Data choosen by learning_curve

MSavel · 23 November 2022 18:33

Hi,

I was wondering how learning_curve chooses the part of the data when it is not 100%.
For example, take train_size = 0.1.
Does learning_curve choose arbitrarily 10% of the dataset and does the cross-validation with always the same 10% of the dataset ? Or for each iteration of the cross-validation, does it choose a different 10% of the database ?

Thank you !

ArturoAmorQ · 24 November 2022 10:26

Hi @MSavel,

The learning_curve utility first performs the CV splits and then only subsamples the resulting training sets. This means that the cv parameter dominates the overall composition of the training sets, in particular the subsampling of the training sets are shuffled with respect to the original dataset ordering.

Additionally, learning_curve has a parameter shuffle (which by default is set to False) that will ensure that the elements selected for the subsampling of X_train and y_trainare drawn at random.

MSavel · 26 November 2022 14:19

Ok thank you for your answer. It is clear for me.