Cross validation with gridsearchv

mrajerison · 27 May 2021 12:15

Hi,

My question relates to the nesting of cross-validation and hyperparameter tuning.

I had difficulties understanding what was the difference between the internal cross validation procedure inside GridSearchCV (defined by cv) and the cross validation procedure of cross_validate

Maybe they process the same way but their objectives are different ?

If I understand well, internal cross validation aims at finding the best representative mean score for each combination of values, hence, globally, finding the best combination of values.

cross_validate aims at testing the accuracy of the model on test data and the stability of the best parameters along folds ? Notably, if the best hyperparameters are the same, from one fold to another one ?

In the case the combination of hyperparameters changes a bit from one fold to another one, how to finally select the best combination of values ?

If it changes a lot, what should we do ?

For instance, with this example, what to do next ? :

Best parameter found on fold #1
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 40}
Best parameter found on fold #2
{'classifier__learning_rate': 0.1, 'classifier__max_leaf_nodes': 30}
Best parameter found on fold #3
{'classifier__learning_rate': 0.05, 'classifier__max_leaf_nodes': 30}

I’d be glad to have your advice !

glemaitre58 · 27 May 2021 13:23

cross-validation is used to check the performance of a model. The grid-search adds a new purpose that is selecting the parameter leading to the best performance. The fact of adding a grid-search cross-validation into another cross-validation allows making sure to always evaluate a model on some testing data that were never used to evaluate the hyperparameters.

By adding the outer cross-validation, we will vary the data of the inner cross-validation and thus potentially make the hyperparameters vary. Thus, we can evaluate the stability of these hyperparameters.

It either means that several model configurations can work (if the evaluation gives good results). In this case, you can constrain the hyperparameter search to one of the solutions. In the case that the evaluation gives bad results, it means that your pipeline is not working and probably you have to act on the input data or the machine learning pipeline design. However, I don’t think that there is a straightforward way to solve the issue.

In this example, the values are indeed changing but slightly. Here a huge change would be that the learning rate is moving from 0.1 to 1 or that max_leaf_nodes is changing from 40 to 400. A good thing is to repeat the cross-validation using sklearn.model_selection.RepeatedKFold to have a lot of data for all hyperparameters and analyze the variance.