Question 4 on hyperparameters tuning

BATTISTELLO · 29 November 2022 08:09

Identification

In my script, I used a the sklearn.model_selection.train_test_split function to perform a 20% split of the data set to form the training set.
If we tuned the number of neighbors in training set after splitting, the difference between 5 and 51 is very high (~0.65 for 51 vs. ~0.95 for 5). Therefore, this change the answer of the question 4.

Question

My question is: when can we skip the splitting part?

ArturoAmorQ · 29 November 2022 14:04

In the notebook Evaluation and hyperparameter tuning we recall that if the intention is not to tune the hyperparameters, a simple cross-validation is enough to evaluate/compare the generalization performance of a given set of models. This doesn’t require further train-test split.

If one wants to evaluate the model with hyperparameter tuning, an extra step is required to select the best set of parameters. This is either a train test split or nested cross-validation.

The one thing we want to avoid is to use knowledge from the full dataset to both decide our model’s hyper-parameters and to train the refitted model.