Hi,
I have several questions about this lesson.
First: how you set the boundaries for the 13 or 14 writers? I tried to find the informations in the original dataset but failed.
Second: you say:
The cross-validation testing error that uses the shuffling has less variance than the one that does not impose any shuffling. It means that some specific fold leads to a low score in this case.
Is it a general rule? Do we have to compare the standard deviations beetween shuffled and not shuffled to bring to light some bias due to data structures? If yes what do you consider as range of difference to be taken into account? here we have a difference of 0.017 beetween both
Third: can you explain a bit the use of GroupKfold
and the groups
argument in the cross_val_score
? Since you used the default GroupKfold
that means you splitted the data in 5 non overlapping groups , but I do not understand how it’s working with groups
Fourth: after grouping you say =>
Besides, we can as well see that the standard deviation was reduced
But :
- std for
KFold without shuffling
= 0.025 - std for
KFold with shuffling
= 0.008 - std for
KFold with groups
= 0.015
So standard deviation with groupping is not improved when compared to with shuffling.
We can also argue about :
We see that this strategy is less optimistic regarding the model statistical performance
since :
for KFold without shuffling
: The average accuracy is 0.921 +/- 0.028
for KFold with groups
: The average accuracy is 0.919 +/- 0.015
if we keep in mind your definition of “better” until now, we can see that these 2 scores are equivalents.
Sorry to ask so many question but I think its important I understand well that lesson.