R - Feature Selection - Model selection with Direct validation (Validation Set or Cross validation)

About

Feature selection through model generation and selection through a direct approach with :

validation set
and cross validation

in R

Articles Related

Method

Validation Set

We are picking a subset of the observations and put them aside and use them as a validation set and the other as a training data set.

Lets make a training and validation set

# for reproducibility, we'll set the random number seed.
set.seed(1)
sampleSize=dim(myDataFrame)[1]
trainingSetSize=2/3*sampleSize
# seq(sampleSize) creates numbers one, two, three, four up to sampleSize
# There's 180 sampleSize chosen at random from the sequence.
trainingIndex=sample(seq(sampleSize),trainingSetSize,replace=FALSE)

Generate all models (here with the forward stepwise methode)

library(leaps)
numberOfModel=dim(myDataFrame)[2]-1 # All variables -1 to exclude the response variable
trainingData=myDataFrame[trainingIndex,]
models=regsubsets(Response~.,data=trainingData,nvmax=numberOfModel,method="forward")

Caluclate the test error for each model on the test set

# Initialize a vector to contain all test errors
testErrors=rep(NA,numberOfModel)
# Validation Data Set (Test Data) Matrix in order to calculate manually the prediction for each model
testData=myDataFrame[-trainingIndex,] # All data that are not training data (-1)
# A model matrix permits to use the formula used to build the model (response twiddle dot, which means model response as a
# function of all the other variables in the headers data frame)
testDataMatrix=model.matrix(Response~.,data=testData)
# For each model, calculate the prediction and the test errors
for(i in 1:numberOfModel){
  # Get the coefficient of the model i
  coefi=coef(models,id=i)
  # Calculate the prediction
  # To get the right elements of testDataMatrix, we index the columns by the names that are on the coefficient vector.
  # That's the subset of columns of testDataMatrix that correspond to the variables that are in th coefficient vector.
  pred=testDataMatrix[,names(coefi)]%*%coefi
  # Calculate the mean square error
  testErrors[i]=mean((myDataFrame$Response[-train]-pred)^2)
}

Plot the data

# Plot the (validation|test) errors: the root mean squared error
plot(sqrt(testErrors),ylab="Root MSE",ylim=c(300,400),pch=19,type="b")
# Plot the Rss
points(sqrt(models$rss[-1]/length(trainingData)),col="blue",pch=19,type="b") # -1 to not plot the null model
# Legend
legend("topright",legend=c("Training","Validation"),col=c("blue","black"),pch=19)

The validation error has a little bit of noise as it is slightly jumpy.
The root mean residual sum of squares is monotone decreasing as it has to be because forward stepwise, each time includes a variable that improves the fit the most. And so therefore, by definition, it's got to improve the residual sum of squares on the training data.

plotting character (PCH) gives a solid dot, which is often nice to visualize on the screen.

Cross-Validation

with k=10, 10 fold cross validation:

Creation of a serie of point from 1 to K for the number of rows of the data. Each observation's going to be assigned a fold number. It's a random assignment of folds to each of the observations.

set.seed(1)
k=10
folds=sample(rep(1:k,length=nrow(myDataFrame)))

[1]  3  5  9  7  6  9  5  1  3  3  2  4  1  7  5  2  3  5  3  3  7  6  7  4  6  9  1  3  3  3
 [31]  6  7  8  6  9  1  3  3 10  8  7  8  3  3  8  6  6  4  3  5  6  1 10  7  7  1  9  4  5  9
 [61]  5  6  1  8  3  9  1  1  2  6  6  7  7  8  2  2  5 10  2  6  9  1  9  3  8  5 10  2  1  2
 [91]  5  7  1  8  8  2  4  3  2  6  8  2  3  8  8  5  1  2  8  4  5  7  6  9  9  4  5  2  7  4
[121]  6  5  6  6  9  9  7  6 10  1  3  2  8  4  1  9  5  4  9  8  2  6 10 10  7  2  9  4  5  1
[151]  3  4  8  3 10  1  9  2  1  8 10  1  7  7  9  4  5 10  4  2  1  3 10  4  4  9 10  7  4  4
[181] 10  3  3  3  5  5  2 10 10  4  9  9  3  2 10  3  8  6  5  2  5  2  2  9  8  6  8  4  6  6
[211]  7  5  2  4  6  6 10  8  8  1  7  4 10  8 10  9  4 10  8  1 10 10  1  4  7  6  1  5  5 10
[241]  5  7 10  9  1 10  2  2  8  7  1  9  7  7  1  4  8  7  9  2  4 10  5

The function Table give the distribution for each value. It's pretty balanced. There is either 26 or 27 observations in each fold.

table(folds)

1  2  3  4  5  6  7  8  9 10 
27 27 27 26 26 26 26 26 26 26

numberOfVariables=ncol(myDataFrame)-1 # For the response variable
# errorsMatrix to save the errors for each fold and for each model size
errorsMatrix=matrix(NA,K,numberOfVariables)
# Loop through the folds
for(k in 1:K){
  # Print the progression
  cat("Fold",k," Forward Stepwise Generation \n")
  # The training index (which observations will be used to train the model
  trainingIndex=(folds!=k)
  # Models Generation with forward method (one for each variables until numberOfVariables)
  subsetModels=regsubsets(Response~.,data=myDataFrame[trainingIndex,],nvmax=numberOfVariables,method="forward")
  # For each susbet model (here=numberOfVariables), calculate the predictions and the errors
  for(i in 1:numberOfVariables){
    testIndex=-trainingIndex # or testIndex=(folds==k)
    # Predictions with the function in the annex (see below)
    pred=predict(subsetModels,myDataFrame[testIndex,],id=i)
    # compute and save the mean squared error of the predictions
    errorsMatrix[k,i]=mean((myDataFrame$Response[testIndex]-pred)^2)
  }
}

Plot

# root mean squared error
# apply the mean function to the columns (2 indicates columns) and take the square of.
rmse=sqrt(apply(errorsMatrix,2,mean))
# show the tenfold cross validation curve.
plot(rmse,pch=19,type="b")

The curve plot must be a little bit smoother because the root mean squared error is computed over the full training set.

Annexe

Predict Method for regsubsets

# Object = regsubset object
# newdata = data frame
# id = model id
predict.regsubsets=function(object,newdata,id,...){
  # All objects got a component called a call that contains creation information. The second value contains the formula.
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  mat[,names(coefi)]%*%coefi
}