Cross validation on train data?

nichorm · 24 February 2022 10:34

Hi,

My question is why the change regarding the previous topic when the cross_validate used the whole data, whereas now with the cross_val_score the score of the model was assessed by inputing the train data?

Because i understand that the idea of using the cross validation is that internally it will create the different folds to ensure a better scoring and not rely on the data split (so no train_test_split is required)

Thanks!

ArturoAmorQ · 24 February 2022 10:41

The cross-validation is used to select the hyperparameters on the training set and evaluate the generalization performance on the test set. This is explained more in detail later in this same module.

nktnlx · 10 April 2022 06:19

I guess, if you use the whole set there might be some data leaks. Thus, you’ll get a “feeling” of better model performance on you test data, that in reality will have nothing in common with the actual generalization strength of your model.
That’s how I understand it. Please, correct me if I’m wrong.