Hello all,
I saw the difference between train and test score when I use cross validate (CV) using ShuffleSplit, and CV using number(integer).
please can you explain me the difference between this two choice
Hello all,
I saw the difference between train and test score when I use cross validate (CV) using ShuffleSplit, and CV using number(integer).
please can you explain me the difference between this two choice
When using an integer, scikit-learn will use a KFold
for regression and StratifiedKFold
for classification where K
is defined as the number of splits and thus the value of the integer. Basically the following snippet will be equivalent (for regression):
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)
cross_validate(model, X, y, cv=cv)
and
cross_validate(model, X, y, cv=5)
Regarding the cross-validation strategies (the last module will present some strategies), ShuffleSplit
is equivalent to shuffling the data and split into a train-test and repeating this process for the number of splits required.
For KFold
, the dataset is cut into 5 partitions. Then we can make 5 repetitions by selecting each time one partition for testing while keeping the other partitions to train.
Therefore, while a sample can be selected several times in a ShuffleSplit
, it will not happen with KFold
. It is one of the reasons that it is a good practice to have a high number of splits with ShuffleSplit
.