Random Selection
This method is the easiest. All we have to do is randomly split the data into train and validation sets.
Let’s say we want to split 80% of our data to the training set, and the others to validation sets.
1 2 3 4 5 6 7 8 |
#Set the number of observations in the training set train_number <- round(0.80*nrow(mtcars),digit = 0) #set seed for reproducible result set.seed(999) #The integers in 'train_index' represent observations train_index <- sample(seq_len(nrow(mtcars)), size = train_number) |
The third command will randomly generate 22 numbers whose max value is the number of row of a dataset (32.) We then will apply the “train_index” to a dataset to create training and validation sets.
1 2 |
train_set <- mtcars[train_index,] validation_set <- mtcars[-train_index,] |
Next, we train the models on the train_set and see how well they perform on a validation_set.
The advantage of the method is ease of use. It only takes five lines of code to create a training set and a test set. But with simplicity, here comes an issue. Since we randomly split the data, how well do we know if the training set well represent the population in general?
1 2 3 4 5 6 7 8 9 10 |
i <- 1 while (i <= 4){ train_set <- round(0.70*nrow(mtcars),digit = 0) set.seed(i) assign(paste('train_index'),sample(seq_len(nrow(mtcars)), size = train_set)) assign(paste('train',i,sep="_"),mtcars[train_index,]) i <- i +1 } |
The code above will create four different training sets. Let’s visualize if they are the same.
1 2 3 4 5 6 7 |
gg_train_1 <- ggplot(train_1,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 1") gg_train_2 <- ggplot(train_2,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 2") gg_train_3 <- ggplot(train_3,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 3") gg_train_4 <- ggplot(train_4,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 4") grid.arrange(gg_train_1, gg_train_2, gg_train_3, gg_train_4,nrow=2,ncol=2) |
They are different.