Lecture Warning

OlgaTrofimova · 8 June 2021 08:14

Hi All,

I didn’t quite understand the warning paragraph at the end of the Manual Tuning lecture. I thought that when we used cross_validate, data was split multiple times into train and test data, and the model was trained on the train data and tested on the test data. So why does it say that we need to apply the selected model to new data? Isn’t it what the test data is about?

There’s obviously something I didn’t get, so I’d be grateful to anyone who could clarify. Here’s the warning in question:

Warning

When we evaluate a family of models on test data and pick the best performer, we can not trust the corresponding prediction accuracy, and we need to apply the selected model to new data. Indeed, the test data has been used to select the model, and it is thus no longer independent from this model.

Thanks,
Olga

glemaitre58 · 8 June 2021 09:04

I will simplify slightly the procedure and come back on the cross-validation in a minute.

Let’s suppose two sets of data: one to train and one to evaluate the model. Let’s suppose that your model is a K nearest-neighbors where K is the hyperparameter. Tuning hyperparameter comes to find the best K on the given dataset.

So one procedure to tune K is to train the KNN on the training set and evaluate it on the testing set and repeat for different K and select the one that performs best.

The warning is given regarding the interpretation of the best score obtained. We highlighted in the previous notebook that when we evaluate a model, we want to use a completely independent test set that has not been used at any point in the training procedure. Here, the fact that we used the best score (and thus the test set) to decide the value of the hyperparameter K violates this assumption. Thus, we cannot use the best score reported during the hyperparameter search as a generalization score on future data.

Indeed, we need to find a new test set, and test our model where K has been tuned to have a better estimate of the generalization performance of the tuned model.

Regarding the cross-validation procedure, they are used instead of a single split to get an idea of the score distribution and the effect of data randomization.

Is it helping?

OlgaTrofimova · 8 June 2021 10:02

It helps a lot, thank you very much for the answer!

So what we should do, after we’ve done this process of selecting the best hyperparameters through multiple iterations, is to take a completely new set of data and test our final model on it. Then accuracy scores will be valid. Is that correct?

glemaitre58 · 8 June 2021 10:19

Yes indeed. You will see in the next notebook that scikit-learn can provide to do such evaluation passing a search CV estimator into the cross-validation scikit-learn function. It will make two-level cross-validation and split adequately the dataset.