Difference between cross_validate with Shuffle=True and ShuffleSplit?

christonikos · 8 November 2022 13:15

Hi, I am a bit puzzled when it comes to the differences between cross_validate(shuffle=True) and ShuffleSplit.

My understanding is that, in the case of the ShuffleSplit, we might end up using the same data-points within each split, whereas in the cross_validate(shuffle=True), this cannot happen.

In general, what is the criterion in selecting one over the other?

Thanks again for your time and this beatiful course.

ArturoAmorQ · 9 November 2022 10:07

Hi, your understanding is correct. I would say that the criterion depends on the intended application.

For most practical purposes, using the KFold approach is a good enough approximation of the variability of the generalization performance. Large values for n_splits will only increase the computing resources and will reduce the testing size. A very small test set increases the variability of the score distribution.

cv1

cv2

As the test size is independent of the number of splits for ShuffleSplit, it will better approximate the “real” score distribution (one can’t really get rid of sampling noise) for increasing n_splits, but will still imply an increase of the computing resources.

cv3

See this notebook from my EuroScipy tutorial for reproducing the plots/additional info.

christonikos · 9 November 2022 10:43

Thanks for getting back to me, Arturo. Great explanation.