Learning_curve with size=1

ThierryParmentelat · 24 February 2022 17:49

in the notebook on learning_curve the last train_size is equal to 1

my understanding was that it meant to train on all the samples, and to test on … no sample ? how can we get a test_score if we test on no sample ?

obviously I am getting something wrong here, please help me spot what it is

ogrisel · 24 February 2022 21:12

The learning_curve utility first perform the CV splits and then only subsamples the resulting training sets.

So the test sets (one per CV iterations) have fixed sizes for all points of the curve.

ThierryParmentelat · 25 February 2022 11:42

ok, so I understand now that train_size has no relation to cv
but there’s still something that I don’t quite get

imagine I run leaning_curve with train_sizes=[1] and cv=KFold(n=2)
so in total we run fit()+score() twice, right ?
can you please explicit what are going to be the training and testing sets for each of these 2 runs ?
does the 100% apply to the whole data or on the 50% that was tagged as test_data ?

thanks !

ogrisel · 25 February 2022 14:18

KFold(n=2) partitions X into X_a and X_b, y into y_a and y_b of equal size (X.shape[0] // 2).
for the first CV iteration:
- define X_train = X_a, X_test = X_b, y_train = y_a, y_test = y_b
- for each train_size in train_sizes:
  - subsample X_train and y_train by train_size elements at random;
  - fit a model on the subsample and score it on the full (X_test, y_test) each time
  - record the score value for train_size
for the second CV iteration:
- define X_train = X_b, X_test = X_a, y_train = y_b, y_test = y_a
- for each train_size in train_sizes:
  - subsample X_train and y_train by train_size elements at random;
  - fit a model on the subsample and score it on the full (X_test, y_test) each time
  - record the score value for train_size
for each train_size in train_sizes:
- compute the average of all the scores (across CV iterations)
- plot the point with the average score on the learning curve

If you want more details, have a look at the source code:

github.com

scikit-learn/scikit-learn/blob/main/sklearn/model_selection/_validation.py#L1352

    
      
              if groups is None:
                  indices = random_state.permutation(len(y))
              else:
                  indices = np.arange(len(groups))
                  for group in np.unique(groups):
                      this_mask = groups == group
                      indices[this_mask] = random_state.permutation(indices[this_mask])
              return _safe_indexing(y, indices)
          
          

          
def learning_curve(
              estimator,
              X,
              y,
              *,
              groups=None,
              train_sizes=np.linspace(0.1, 1.0, 5),
              cv=None,
              scoring=None,
              exploit_incremental_learning=False,
              n_jobs=None,

ThierryParmentelat · 25 February 2022 16:57

Ok, now I get it (slap on the forehead)
Thanks for spelling it out for us !