Difference between KFold with shuffle=True and ShuffleSplit?

echidne · 30 June 2021 15:01

Hi everyone,
I’m wondering the difference beetween the use of KFold(shuffle=True) and ShuffleSplit() to shuffle the data.
I did try to replace KFold(shuffle=True) by ShuffleSplit() in the exercice M6.05 and I obtained the same parameters but with better R2 scores with the latter.

after Kfold :
mean R2 score = 0.839 +/- 0.006

after ShuffleSplit(n_splits =5 , random_state=0):
mean R2 score = 0.850 +/- 0.009

after ShuffleSplit(n_splits =5 , test_size=.25, random_state=0):
mean R2 score = 0.844 +/- 0.007

So what is the difference beetween both and could we use ShuffleSplit() here?

glemaitre58 · 30 June 2021 18:00

KFold(shuffle=True) will shuffle the rows of the dataset and then split into k disjoint partitions.
ShuffleSplit(n_splits=5) is different: we shuffle the data and pick 25% (with a uniform sampling) for the test. We repeat x5 the shuffling + sampling. Thus, in this case, the different tests set are not necessarily disjoint: a sample can be picked several times in the test set.

It could explain why you might get a better score with ShuffleSplit and it could also be linked with some underlying structure of the dataset.

glemaitre58 · 30 June 2021 18:02

Regarding the structure of the dataset, we present mote into details some of these aspects in the Module 7 “choice of cross-validation”.

echidne · 30 June 2021 18:53

I get the difference beetween KFold and ShuffleSplit, but in that precise case is KFold more recommended than ShuffleSplit ?

I have another question: we see in the graph for the KFold shuffled data that none of the parameters reached the mean R2 score of the model. How it is possible? (I probably missed something in the ensemble methods)

glemaitre58 · 30 June 2021 19:17

I don’t think that ShuffleSplit is more appropriate. I even think that one should use more splits when using ShuffleSplit compare to the KFold. The fact that the test samples could be selected several time could be an issue.

I think that it might be due to some dataset structure (that I am not aware of ). At the end of the grid-search, we refit the best_estimator_ on the full training set. It means that the score estimates obtained during the cross-validation are indeed not anymore a good estimate for the performance of `best_estimator_.

The fact that this is not the case with the ShuffleSplit makes me think that there is an underlying structure in the dataset.

echidne · 1 July 2021 08:37

All clear