Cross-Validation: Random Selection

Cross-validation is one of the essential steps before deploying a predictive model. In this article, we will cover why and how we should perform these steps.

Random Selection
This method is the easiest. All we have to do is randomly split the data into train and validation sets.

Let’s say we want to split 80% of our data to the training set, and the others to validation sets.

#Set the number of observations in the training set
train_number <- round(0.80*nrow(mtcars),digit = 0) 

#set seed for reproducible result
set.seed(999) 

#The integers in 'train_index' represent observations
train_index <- sample(seq_len(nrow(mtcars)), size = train_number)

#Set the number of observations in the training set

train_number <- round(0.80*nrow(mtcars),digit = 0)

#set seed for reproducible result

set.seed(999)

#The integers in 'train_index' represent observations

train_index <- sample(seq_len(nrow(mtcars)), size = train_number)

The third command will randomly generate 22 numbers whose max value is the number of row of a dataset (32.) We then will apply the “train_index” to a dataset to create training and validation sets.

train_set <- mtcars[train_index,]
validation_set <- mtcars[-train_index,]

1 2	train_set <- mtcars[train_index,] validation_set <- mtcars[-train_index,]

Next, we train the models on the train_set and see how well they perform on a validation_set.

The advantage of the method is ease of use. It only takes five lines of code to create a training set and a test set. But with simplicity, here comes an issue. Since we randomly split the data, how well do we know if the training set well represent the population in general?

i <- 1
while (i <= 4){
  train_set <- round(0.70*nrow(mtcars),digit = 0) 
  
  set.seed(i)
  assign(paste('train_index'),sample(seq_len(nrow(mtcars)), size = train_set))
  assign(paste('train',i,sep="_"),mtcars[train_index,])
  
  i <- i +1
}

i <- 1

while (i <= 4){

train_set <- round(0.70*nrow(mtcars),digit = 0)

set.seed(i)

assign(paste('train_index'),sample(seq_len(nrow(mtcars)), size = train_set))

assign(paste('train',i,sep="_"),mtcars[train_index,])

i <- i +1

}

The code above will create four different training sets. Let’s visualize if they are the same.

gg_train_1 <- ggplot(train_1,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 1")
gg_train_2 <- ggplot(train_2,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 2")
gg_train_3 <- ggplot(train_3,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 3")
gg_train_4 <- ggplot(train_4,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 4")

grid.arrange(gg_train_1, gg_train_2,
gg_train_3, gg_train_4,nrow=2,ncol=2)

gg_train_1 <- ggplot(train_1,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 1")

gg_train_2 <- ggplot(train_2,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 2")

gg_train_3 <- ggplot(train_3,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 3")

gg_train_4 <- ggplot(train_4,aes(x=mpg,y=disp)) + geom_point()+ theme_moma() + ggtitle("Train 4")

grid.arrange(gg_train_1, gg_train_2,

gg_train_3, gg_train_4,nrow=2,ncol=2)

They are different.