cv=ShuffleSplit vs cv=int

mbogol · 3 March 2022 11:12

Hello all,

I saw the difference between train and test score when I use cross validate (CV) using ShuffleSplit, and CV using number(integer).

please can you explain me the difference between this two choice

glemaitre58 · 3 March 2022 11:51

When using an integer, scikit-learn will use a KFold for regression and StratifiedKFold for classification where K is defined as the number of splits and thus the value of the integer. Basically the following snippet will be equivalent (for regression):

from sklearn.model_selection import KFold

cv = KFold(n_splits=5)
cross_validate(model, X, y, cv=cv)

and

cross_validate(model, X, y, cv=5)

Regarding the cross-validation strategies (the last module will present some strategies), ShuffleSplit is equivalent to shuffling the data and split into a train-test and repeating this process for the number of splits required.

For KFold, the dataset is cut into 5 partitions. Then we can make 5 repetitions by selecting each time one partition for testing while keeping the other partitions to train.

Therefore, while a sample can be selected several times in a ShuffleSplit, it will not happen with KFold. It is one of the reasons that it is a good practice to have a high number of splits with ShuffleSplit.

ArturoAmorQ · 3 March 2022 13:40

This is also covered in the video Validation of a model in Module 1.