Nested cross-validation is introduced in the notebook Cross-validation and hyperparameter tuning with the following sentence:
As mentioned earlier, using a single train-test split during the grid-search does not give any information regarding the different sources of variations: variations in terms of test score or hyperparameters values.
To get reliable information, the hyperparameters search need to be nested within a cross-validation.
I think it could be a good idea to spend a bit more time explaining the rationale behind nested cross-validation. If I understand correctly, GridSearchCV
by default performs a 5-fold cross validation for each of the grid points (each combination of hyperparameters). Thus, for each such combination, we already get an idea of how the performance of the model varies due to variations in the train/test split. I’m not sure why we need the outer cross-validation loop then - if hyperparameters were highly sensitive to the train/test split, we should already get an indication of that with GridSearchCV
. For example, suppose for simplicity I have a 2x2 grid of hyperparams - if I don’t have hyperparameter stability, I should get something like this:
- accuracy .790 +/- 0.2
- accuracy .810 +/- 0.2
- accuracy .815 +/- 0.4
- accuracy .820 +/- 0.2
There are hyperparam combinations which, on average, look better than others, but there is such a huge variability in the results due to the train/test split that we cannot really say (in particular, combination #2 seems to be mostly unstable). Why do we also need an outer CV cycle? At least one paper investigated the possibility that nested CV might be “too much”: not saying they’re right, but at least mine isn’t a stupid question