Score from final test set vs CV score

Hi,
This is related to the following claim from this notebook:
“The score measure on the final test set (after train_test_split) is almost with the range of the internal CV score for the best hyper-parameter combination. This is reassuring as it means that the tuning procedure did not cause significant overfitting in itself (otherwise the final test score would have been lower than the internal CV scores).”
I have applied the same procedure to another dataset and used MAE as scoring. I am doubly confused about the results. First, the MAE from the final test set is about twice lower (therefore better) than the internal CV MAE for the best hyper-parameter combination. I would have expected to opposite. Second, the MAE I got from model_grid_search.score(data_test, target_test) is quite different (not the average, not the mean) from the model_grid_search.cv_results_ MAEs. What does it mean? Thanks for your help.
Stéphane

You should as well consider the size of your dataset. If working with a large dataset, since you are refitting the model on a larger train set, the error can indeed decrease. This is linked with what is presented with the learning curve: more samples will lead to a lower error until reaching a plateau.

Indeed, it might not only be the only cause but it is the one that I can think of right now.

It is indeed the same thing as I explained in the above paragraph. cv_results_ are score computed by training the model on a train set of each fold and keeping a portion to validate. When calling score, you are using the full training set (internal training + validation together) to train a single model and have more data points. I might therefore expect the error with score to be smaller than the one reported in cv_results_ potentially.

Thanks. I think I got it.
When using a single train-test split and a 5-fold cross-validation strategy, the final model is fitted on the concatenation of the train (16/25 of the dataset) and validation (4/25) samples and evaluated on the test samples (1/5). So, the total size of the concatenate train set is 20/25=4/5.
When using a grid-search strategy without train-test split, the model is fitted on 4/5 of the dataset.
In conclusion, the model is trained on the same number of data points with initial train-test split (80%) than without (80%). Therefore, we can expect to have the same error with initial train-test split than without. Am I right? However, I have an error (MAE) that is twice smaller with the train-test split. Besides, using the nested CV, I got errors compatible with what I have without the single train-test split. So, my conclusion is that the single test-train split picked up a configuration of the train set and test set that gives a small error that is not representative. Do you think it is reasonable?