Cross-validation and data

FSeraphine · 17 February 2022 11:13

On wich data do we do the cross-validation ? Training data ? Test data ? All the data ?

ArturoAmorQ · 17 February 2022 11:21

In the case where we are only estimating the generalization performance of the model, cross-validation is already equivalent to making a train-test split several times, so no further train-test split is needed at this point and therefore, the whole data is to be used.

Other cases of cross-validation are presented in Modules 3 and 7.

iauhs2 · 17 February 2022 21:29

but if we do the cross-validation with the whole data, this will pollute the accuracy on the test data

glemaitre58 · 17 February 2022 21:34

What do you mean by polluting the test data?

iauhs2 · 17 February 2022 22:07

we use the CV to find the best model with only the train data and calcul the score with the test data to see if there is overfitting ? if we use the whole data to train the data with CV, we can’t recheck the model with the test data, because they are used in the CV ?

glemaitre58 · 17 February 2022 23:16

As far I remember, at this stage of the course, you only use cross-validation to evaluate a given model. There is no selection of any kind (hyperparameter tuning is coming afterwards).

Therefore, it is completely fine to provide the entire set in the cross-validation. Later on, in the chapter “Select the best model”, you will want to act on the complexity of the model by setting the hyperparameter of the model. In this case, you will apply nested cross-validation where inner cross-validation will be applied only on the training set to select the best parameters. But this comes in the upcoming module.

twesigyeronald · 20 February 2022 19:05

You perform cross-validation on all the data.

Like the tutor says in the video, cross-validation is a more systematic way of evaluating the generalization performance of a model. Therefore, just like you provide all your data to the train-test-split function, in the same way, you also supply all your data to cross_val_score function.

I personally think that it doesn’t make sense to provide only test data for cross-validation because the cross-validation process is going to split it (again). That is why you would have to provide all the data not part of the data (train data or test data).

Also, the cross-validation process is not performed after a train-test-split operation. It is performed alone. Train-test-split and cross-validation processes are two separate methods that are being used (separately) to evaluate generalization of the model. You either choose one though cross-validation does a better job.

iauhs2 · 6 March 2022 09:32

thanks for the reply, and you are right, the CV with the test data makes no sense.