M3-Lecture-Test / Train / Cross Validation

blooridian · 13 June 2021 14:27

Hello, Maybe someone can help me to understand what is best practice (basic question I believe)

I understand that cros validation will generate several train / test data and provide (simply speaking) performance results & scatter for the given model.

In M3 lecture ( Cross-validation and hyperparameter tuning) : An initial operation of train / test split is performed. and then cross validation seems to be applied on the entire data.

I am slightly confused about the interest of train/test initial split even if seem recommended in doc.

Finally :
What is good workflow?

If CrossValidation is proving appropriate modle results eg no differences between train & test results, why can’t I just apply this model on the entire data to get full model metrics?

Thanks

glemaitre58 · 14 June 2021 08:30

Could you provide the exact name and section of the notebook where we should this train/test split to give an accurate answer?

So the right workflow would be nested cross-validations. With some minimal code just to illustrate

X, y = load_data(...)
search_cv = GridSearchCV(model, param_grid, ...)
cross_validate(search_cv, X, y, ...)

cross_validate will apply an outer cross-validation while GridSearchCV will split the train set provided by cross-validate to make an inner cross-validation.

Thus, you are passing the full dataset indeed but it will be split by the different cross-validation internally.

ag_38 · 15 June 2021 12:53

I had a similar question.
It’s from this section Cross-validation and hyperparameter tuning, where there is

Once the dataset is loaded, we split it into a training and testing sets.

sets which are never used, which led to think there was a typo or a minsunderstanding from our side.
Your answer is clear, maybe it would be useful to incorporate it in the lecture (one tends to get lost with the nested things and as the spliting is automatic it’s not that obvious that GridSearchCV only gets the training set).

glemaitre58 · 16 June 2021 08:15

You are right, It a cell that we should remove. It is not linked with the later discussion indeed.

glemaitre58 · 16 June 2021 08:19

It has been corrected in: FIX remove useless cell · INRIA/scikit-learn-mooc@608dc0d · GitHub

Changes will appear by synchronizing the notebook (clicking on File → Revert to original).