ShuffleSplit vs train_test_split

nichorm · 21 February 2022 11:38

Hi all,

As in this section we are using more often the ShuffleSplit i have the following question: Is it always better to use it in comparison to the train_set_split?

Like, combining ShuffleSplit with the cross_validate we get in less lines the score of both test and train.

Thanks in advance!

ArturoAmorQ · 21 February 2022 13:04

As mentioned in the notebook " Cross-validation framework" in the “Overfitting and Underfitting” section of Module 2:

a single train-test split we don’t give any indication regarding the robustness of the evaluation of our predictive model: in particular, if the test set is small, this estimate of the testing error will be unstable and wouldn’t reflect the “true error rate” we would have observed with the same model on an unlimited amount of test data.

In that sense it is better to use cross-validation: either with KFold strategy, ShuffleSplit or other strategies that will be covered more in detail in Module 7.

When to use a given strategy depends on the dataset and the user case, for instance, KFold may suffice to estimate the generalization performance of a model, but ShuffleSplit may provide more information about how the scores are distributed across folds and will be more robust to dataset ordering, as will be covered in Module 7.

manifr · 16 May 2022 09:08

An important point if it is not clear is cross_validate() internally does the same thing as train_set_split() but over different folds of the data, as decided by the ShuffleSplit. Like you mentioned in your question itself, here with less lines of code, you get the complete picture. Elegant!