3 Learning, validation, testing

Why slicing data:

Prevent overfitting
Test the generalizing capacity of a model on new data

A classic scheme:

Training set (70%) : to adjust the model.
Validation set (20%) : to adjust the hyperparameters.
Test set (10%) : to evaluate final performance

3.1 Cross-validation

Cross-validation is a method of estimating model reliability based on a sampling technique.

3.1.1 K-fold cross-validation

The original sample is divided into K samples (or blocks), then one of the K samples is selected as the validation set, while the other K − 1 samples constitute the training set.

Split the data into K sub-samples of the same size
For \(k = 1,··· ,K,\):
- estimate the rule on the private data of sample k
- predict the data of sample k with this rule
Compute the performance criterion on these K predictions

3.1.2 Leave-one-out cross-validation (LOOCV)

Leave-one-out cross-validation is the special case of K block cross-validation with K = n. That is, at each learning-validation iteration, learning is performed on \(n − 1\) observations and validation on the single remaining observation.

For \(i = 1,··· ,n\)
- estimate the rule on the data without the \(i_{th}\) data
- predict this data i with this rule
Compute the performance criterion on these n predictions