The rationale behind nested cross-validation

Nested cross-validation is introduced in the notebook Cross-validation and hyperparameter tuning with the following sentence:

As mentioned earlier, using a single train-test split during the grid-search does not give any information regarding the different sources of variations: variations in terms of test score or hyperparameters values.

To get reliable information, the hyperparameters search need to be nested within a cross-validation.

I think it could be a good idea to spend a bit more time explaining the rationale behind nested cross-validation. If I understand correctly, GridSearchCV by default performs a 5-fold cross validation for each of the grid points (each combination of hyperparameters). Thus, for each such combination, we already get an idea of how the performance of the model varies due to variations in the train/test split. I’m not sure why we need the outer cross-validation loop then - if hyperparameters were highly sensitive to the train/test split, we should already get an indication of that with GridSearchCV. For example, suppose for simplicity I have a 2x2 grid of hyperparams - if I don’t have hyperparameter stability, I should get something like this:

  1. accuracy .790 +/- 0.2
  2. accuracy .810 +/- 0.2
  3. accuracy .815 +/- 0.4
  4. accuracy .820 +/- 0.2

There are hyperparam combinations which, on average, look better than others, but there is such a huge variability in the results due to the train/test split that we cannot really say (in particular, combination #2 seems to be mostly unstable). Why do we also need an outer CV cycle? At least one paper investigated the possibility that nested CV might be “too much”: not saying they’re right, but at least mine isn’t a stupid question :slightly_smiling_face:

1 Like

The inner grid search is indeed enough to select a model. However, the inner CV does not allow you to get a good estimate of the performance of the selected model on the full dataset. The outer CV is used to get a better estimate indeed.

We have a simple example in scikit-learn but this is not the best compelling example (the classification problem is too easy I think): Nested versus non-nested cross-validation — scikit-learn 0.24.2 documentation

2 Likes

Thanks for the answer, but the code in the link you shared actually increased my confusion :woozy_face: I’m starting to think there’s a bug in scikit-learn! I put my code, adapted from the link you gave me, in the sandbox notebook. Let me know if you can read the code, otherwise I’ll copy & paste it here.

The main issue I have is that inner_cv and outer_cv generate exactly the same index sets: this is obvious from their calls;

    inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
    outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

Now look at the following code:

    # Non_nested parameter search and scoring
    clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
    clf.fit(X_iris, y_iris)
    non_nested_scores[i] = clf.best_score_

    # Nested CV with parameter optimization
    nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
    nested_scores[i] = nested_score.mean()

GridSearchCV and cross_val_score are using exactly the same splits (inner_cv and outer_cv, which are identical) and the same best hyperparameters to compute clf.best_score_ and nested_score.mean(). However, the results are not the same. Why? In my sandbox notebook, for each one of the 30 trials, from clf.cv_results_ I extracted the 5 test scores corresponding to the best hyperparameters for that trial and compared them to the test scores computed by cross_val_score. As you can see, they’re either equal or differing for just one split, which leads me to suspect there might be a bug. Can you help me understand?

PS to double check that I correctly recovered the test scores corresponding to the best hyperparameters, I checked that the average of the test scores I extracted is indeed equal to clf.best_score_:

# check that the average of the test scores I recovered is equal to clf.best_score_
assert best_test_scores.mean() == clf.best_score_

as you can verify yourself, the assertion is valid for all 30 trials.

@glemaitre58 hi! Any comments on my code above?

They are defined with the same random state but they will not provide the same indices. The indices are generated from a given X and y and these variables will be different from the inner and outer CV and thus the indices.

This code is equivalent of running a cross-validation on the full dataset and store the test score and pick up the best one.

Here, cross_val_score will split the dataset a first time using the outer_cv object and the clf will call fit on the training set provided by cross_val_score and not the full dataset. It is thus different from the snippet above.