Nested cross-validation with same random_state

RandomVariable · 2 July 2021 14:32

Why are inner_cv and outer_cv defined using the same random_state in KFold. Wouldn’t it result in the same folds ?

RandomVariable · 2 July 2021 15:10

In fact if in change outer_cv to innter_cv
test_score = cross_val_score(model, data, target, cv=outer_cv, n_jobs=2)
in Nested cross-validation — Scikit-learn course
I get exactly the same results.

Here are some tests

test_score_not_nested = []
test_score_nested = []
test_score_nested_same_inner = []
test_score_nested_fixed = []
test_score_nested_best = []
test_score_nested_best_fixed = []

N_TRIALS = 20
for i in range(N_TRIALS):
    inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
    outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)    
    
    # Non_nested parameter search and scoring
    model = GridSearchCV(estimator=model_to_tune, param_grid=param_grid,
                         cv=inner_cv, n_jobs=2)
    model.fit(data, target)
    test_score_not_nested.append(model.best_score_)

    # Nested CV with parameter optimization
    test_score = cross_val_score(model, data, target, cv=outer_cv, n_jobs=2)
    test_score_nested.append(test_score.mean())
    
    ## Custom tests ##
    # Reuse inner_cv
    test_score = cross_val_score(model, data, target, cv=inner_cv, n_jobs=2)
    test_score_nested_same_inner.append(test_score.mean())
    
    # Fix: Use different folds
    outer_cv_fixed = KFold(n_splits=4, shuffle=True, random_state=i+N_TRIALS)
    test_score = cross_val_score(model, data, target, cv=outer_cv_fixed, n_jobs=2)
    test_score_nested_fixed.append(test_score.mean())
    
    # Eval best_estimator_ 
    test_score = cross_val_score(model.best_estimator_, data, target, cv=outer_cv, n_jobs=2)
    test_score_nested_best.append(test_score.mean())
    
    # Eval best_estimator_ with fix
    test_score = cross_val_score(model.best_estimator_, data, target, cv=outer_cv_fixed, n_jobs=2)
    test_score_nested_best_fixed.append(test_score.mean())


import pandas as pd

all_scores = {
    "Not nested CV": test_score_not_nested,
    "Nested CV": test_score_nested,
    "Nested CV - inner = outer": test_score_nested_same_inner,
    "Nested CV - fixed": test_score_nested_fixed,
    "Nested CV - best_estimator": test_score_nested_best,
    "Nested CV - best_estimator & fixed": test_score_nested_best_fixed,
}
all_scores = pd.DataFrame(all_scores)

Here is what I get.

I are another question why the results on cross_val_score on the model not equal to on model.best_estimator_ but worse ?

glemaitre58 · 5 July 2021 08:29

No because, the input data will be different. The outer CV will be given the entire dataset while the inner CV will be given the training set provided by the outer CV. Since the inputs are different, the data sampled will be different as well even with the same random state.