Please I need some explanations in the sample grouping lecture

echidne · 2 July 2021 15:30

Hi,
I have several questions about this lesson.

First: how you set the boundaries for the 13 or 14 writers? I tried to find the informations in the original dataset but failed.

Second: you say:

The cross-validation testing error that uses the shuffling has less variance than the one that does not impose any shuffling. It means that some specific fold leads to a low score in this case.

Is it a general rule? Do we have to compare the standard deviations beetween shuffled and not shuffled to bring to light some bias due to data structures? If yes what do you consider as range of difference to be taken into account? here we have a difference of 0.017 beetween both

Third: can you explain a bit the use of GroupKfold and the groups argument in the cross_val_score? Since you used the default GroupKfold that means you splitted the data in 5 non overlapping groups , but I do not understand how it’s working with groups

Fourth: after grouping you say =>

Besides, we can as well see that the standard deviation was reduced

But :

std for KFold without shuffling = 0.025
std for KFold with shuffling = 0.008
std for KFold with groups= 0.015

So standard deviation with groupping is not improved when compared to with shuffling.

We can also argue about :

We see that this strategy is less optimistic regarding the model statistical performance

since :
for KFold without shuffling : The average accuracy is 0.921 +/- 0.028
for KFold with groups : The average accuracy is 0.919 +/- 0.015

if we keep in mind your definition of “better” until now, we can see that these 2 scores are equivalents.

Sorry to ask so many question but I think its important I understand well that lesson.

glemaitre58 · 5 July 2021 08:54

It was some reverse engineering: Small clarification in 'Sample Grouping' notebook - #7 by glemaitre58

The general rule here is, if a structure exists, the shuffling will break it and make the classification/regression easier that will lead to a better score. Here, there a non-negligible improvement in the mean test score, and it comes with a reduction of the variance. These two observations give an intuition that there is something going on and unrelated to randomness. However, in practice, it might be rather difficult to come with a systematic manner to detect such leakage.

I will create 5 folds and ensure that a group that is in the training set will not be part of the testing set. Thus, the folds can be of different sizes.

We wanted to compare between with shuffling and with grouping. The with shuffling was used to find that there is a structure.

Yes here, the scores are equivalent. We should change the message here. In general, it could happen that not taking care of the group will lead to an overestimation of the statistical performance. Here, we see that the change is not dramatric but we reduce the standard deviation.

echidne · 5 July 2021 10:10

Thank to have taken time to answer so many questions at once.

If I may:

We wanted to compare between with shuffling and with grouping . The with shuffling was used to find that there is a structure.

I suppose you mean?:

We wanted to compare between without shuffling and with grouping . The with shuffling was used to find that there is a structure.

For the reduction of the standard deviation I m still doubting. You can argue that we have nearly a 2x reduction of the standard deviation beetween without shuffling and with grouping but we are comparing very small numbers.In a real situation if I had to defend my model on that so small difference I will propably have some difficulties to convince. Specially when you see that even with a “reduced” std there is no signifcant difference beetween the 2 accuracy scores.

glemaitre58 · 5 July 2021 10:47

Yes, you can read my mind

I think here the point is not to defend the model but to instead get more insights regarding the data. Basically, is this reduced standard deviation linked with a pattern in the data? Is there a fold that my first model was completely off compared to the second CV approach (we might want to look at the individual score and not only std. dev.) What are the reasons if there is one?

So basically, the analysis does necessarily answer the question of which model is best or which approach is better but rather raises additional questions that a data scientist should investigate to go further.

echidne · 5 July 2021 11:10

I agree but honestly in taken into account the results for the means and std accuracy for without shuffling and with grouping we obtained here I should have never considered something was hiding in the data.
The problem here is that we have no real clue on what difference beetween the std is enough to start to think something is strange with the pattern of the data. No rule tof thumb, no statistical test to compare the std or no other test that can come to confirm or infirm that statement.

I suppose it could be possible to transform the “groups” in features and to test their impact on the data?

glemaitre58 · 5 July 2021 11:30

I don’t recall the group exactly but might even have a number of samples that are almost the same for each writer so a KFold would almost be the same as using the specific group.

Yes. You could even try to predict the group from the data at hand to see it is indeed possible.

echidne · 5 July 2021 11:40

I will not bother you anymore on the subject but I feel very disturbed to state a difference of statistic metrics of 0.013 is enough to be noted and taken into account. It s in contradiction to what I have learned in my young time.
Thank again to have try to clear my mind.