Sorry to still turning around that concept but I wanted to test if the reduction of the std is really linked to the grouping
.
I reasonned that if resampled the data in order that they were no more linked to the writers , grouping
will have no more effect.
first I repeated the first steps of the lecture:
from sklearn.datasets import load_digits
digits = load_digits()
data, target = digits.data, digits.target
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LogisticRegression())
from itertools import count
import numpy as np
# defines the lower and upper bounds of sample indices
# for each writer
writer_boundaries = [0, 130, 256, 386, 516, 646, 776, 915, 1029,
1157, 1287, 1415, 1545, 1667, 1797]
groups = np.zeros_like(target)
lower_bounds = writer_boundaries[:-1]
upper_bounds = writer_boundaries[1:]
for group_id, lb, up in zip(count(), lower_bounds, upper_bounds):
groups[lb:up] = group_id
Then I created a dataframe with the datas and the target:
import pandas as pd
digits_rec = pd.DataFrame(digits.data)
digits_rec["target"]= pd.Series(digits.target)
and I resplit in datas and target:
data_rec = digits_rec.drop('target', axis=1)
target_rec = digits_rec['target']
I tested if all is ok in re_calculating the accuracies and the std as in the lecture:
from sklearn.model_selection import cross_val_score, KFold, GroupKFold
kf = KFold(shuffle=False)
test_score_no_shuffling = cross_val_score(model, data_rec, target_rec, cv=kf,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score_no_shuffling.mean():.3f} +/- "
f"{test_score_no_shuffling.std():.3f}")
ouput:
The average accuracy is 0.921 +/- 0.028
cv = KFold(shuffle=True)
test_score_shuffling = cross_val_score(model, data_rec, target_rec, cv=cv,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score_shuffling.mean():.3f} +/- "
f"{test_score_shuffling.std():.3f}")
ouput:
The average accuracy is 0.970 +/- 0.007
cv = GroupKFold()
test_score = cross_val_score(model, data_rec, target_rec, groups=groups, cv=cv,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score.mean():.3f} +/- "
f"{test_score.std():.3f}")
ouput:
The average accuracy is 0.919 +/- 0.015
for the moment so far so good, I correctly find again the scores of the lecture.
Now I shuffle the data in resetting the indexes:
digits_rec_shuffled=digits_rec.iloc[np.random.permutation(digits_rec.index)].reset_index(drop=True)
data_shuffled = digits_rec_shuffled.drop('target', axis=1)
target_shuffled = digits_rec_shuffled['target']
then I calculated again the accuracies and the std:
kf = KFold(shuffle=False)
test_score_no_shuffling = cross_val_score(model, data_shuffled, target_shuffled, cv=kf,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score_no_shuffling.mean():.3f} +/- "
f"{test_score_no_shuffling.std():.3f}")
ouput:
The average accuracy is 0.968 +/- 0.011
kfs = KFold(shuffle=True)
test_score_shuffling = cross_val_score(model, data_shuffled, target_shuffled, cv=kfs,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score_shuffling.mean():.3f} +/- "
f"{test_score_shuffling.std():.3f}")
ouput:
The average accuracy is 0.968 +/- 0.008
As expected my resampling/shuffling of the data leads to have equivalent result beetween KFold(shuffle=False)
and KFold( shuffle=True)
Now with GroupKfold() with the indexes that do not correspond anymore to the writers:
gkf = GroupKFold()
test_score = cross_val_score(model, data_shuffled, target_shuffled, groups=groups, cv=gkf,
n_jobs=-1)
print(f"The average accuracy is "
f"{test_score.mean():.3f} +/- "
f"{test_score.std():.3f}")
ouput:
The average accuracy is 0.971 +/- 0.006
As expected the mean accuracy is not improved but we see that the decrease of the std when compared to KFold(shuffle=False)
is in the same order than in the lecture.
I repeated several time the experiment in resampling the data in different manners but each time I see that with GroupKFold()
the std is lower, so I wonder if that low std is not linked more to the used algorithm than as a prove of hidden structure in data?