GroupKFold on digits: is std improvement really linked to the grouping?

echidne · 5 July 2021 16:45

Sorry to still turning around that concept but I wanted to test if the reduction of the std is really linked to the grouping.
I reasonned that if resampled the data in order that they were no more linked to the writers , grouping will have no more effect.
first I repeated the first steps of the lecture:

from sklearn.datasets import load_digits

digits = load_digits()
data, target = digits.data, digits.target

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())

from itertools import count
import numpy as np

# defines the lower and upper bounds of sample indices
# for each writer
writer_boundaries = [0, 130, 256, 386, 516, 646, 776, 915, 1029,
                     1157, 1287, 1415, 1545, 1667, 1797]
groups = np.zeros_like(target)
lower_bounds = writer_boundaries[:-1]
upper_bounds = writer_boundaries[1:]

for group_id, lb, up in zip(count(), lower_bounds, upper_bounds):
    groups[lb:up] = group_id

Then I created a dataframe with the datas and the target:

import pandas as pd
digits_rec = pd.DataFrame(digits.data)
digits_rec["target"]= pd.Series(digits.target)

and I resplit in datas and target:

data_rec = digits_rec.drop('target', axis=1)
target_rec = digits_rec['target']

I tested if all is ok in re_calculating the accuracies and the std as in the lecture:

from sklearn.model_selection import cross_val_score, KFold, GroupKFold

kf = KFold(shuffle=False)
test_score_no_shuffling = cross_val_score(model, data_rec, target_rec, cv=kf,
                                          n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score_no_shuffling.mean():.3f} +/- "
      f"{test_score_no_shuffling.std():.3f}")

ouput:

The average accuracy is 0.921 +/- 0.028

cv = KFold(shuffle=True)
test_score_shuffling = cross_val_score(model, data_rec, target_rec, cv=cv,
                                          n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score_shuffling.mean():.3f} +/- "
      f"{test_score_shuffling.std():.3f}")

ouput:

The average accuracy is 0.970 +/- 0.007

cv = GroupKFold()
test_score = cross_val_score(model, data_rec, target_rec, groups=groups, cv=cv,
                             n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score.mean():.3f} +/- "
      f"{test_score.std():.3f}")

ouput:

The average accuracy is 0.919 +/- 0.015

for the moment so far so good, I correctly find again the scores of the lecture.

Now I shuffle the data in resetting the indexes:

digits_rec_shuffled=digits_rec.iloc[np.random.permutation(digits_rec.index)].reset_index(drop=True)
data_shuffled = digits_rec_shuffled.drop('target', axis=1)
target_shuffled = digits_rec_shuffled['target']

then I calculated again the accuracies and the std:

kf = KFold(shuffle=False)
test_score_no_shuffling = cross_val_score(model, data_shuffled, target_shuffled, cv=kf,
                                          n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score_no_shuffling.mean():.3f} +/- "
      f"{test_score_no_shuffling.std():.3f}")

ouput:

The average accuracy is 0.968 +/- 0.011

kfs = KFold(shuffle=True)
test_score_shuffling = cross_val_score(model, data_shuffled, target_shuffled, cv=kfs,
                                          n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score_shuffling.mean():.3f} +/- "
      f"{test_score_shuffling.std():.3f}")

ouput:

The average accuracy is 0.968 +/- 0.008

As expected my resampling/shuffling of the data leads to have equivalent result beetween KFold(shuffle=False) and KFold( shuffle=True)

Now with GroupKfold() with the indexes that do not correspond anymore to the writers:

gkf = GroupKFold()
test_score = cross_val_score(model, data_shuffled, target_shuffled, groups=groups, cv=gkf,
                             n_jobs=-1)
print(f"The average accuracy is "
      f"{test_score.mean():.3f} +/- "
      f"{test_score.std():.3f}")

ouput:

The average accuracy is 0.971 +/- 0.006

As expected the mean accuracy is not improved but we see that the decrease of the std when compared to KFold(shuffle=False) is in the same order than in the lecture.
I repeated several time the experiment in resampling the data in different manners but each time I see that with GroupKFold() the std is lower, so I wonder if that low std is not linked more to the used algorithm than as a prove of hidden structure in data?