Small clarification in 'Sample Grouping' notebook

Dear all,

I find the explanation below the print(digits.DESCR) cell a bit confusing. It is stated ‘If we read carefully, 13 witers wrote the digits of our dataset…’ but actually, 43 writers wrote the digits (30 wrote the training set digits and 13 different wrote the test set). Maybe rephrase this into ‘if we read carefully, 13 writers wrote the digits in the test set, while 30 different writers wrote the digits in the training set’. Thank you for your great work and best wishes,
Pia

So what is happening is that sklearn.datasets.load_digits only return the test set of the original UCI dataset:

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

So I believe the text is actually correct, meaning there are indeed 13 different writers that have annotated the digits in our dataset.

We may try to clarify the wording to be more clear though …
Edit: We also need to look at these 13 vs 14 groups see below …

Hi,
By the way I see 14 groups defined in the notebook (indexes 0 to 13) …?
Best wishes,
Camille

Indeed … to be honest I am not sure where we got the annotaters boundaries from, maybe @glemaitre58 remembers?

By looking closely at the data by hand :slight_smile: If you look at the series of digits in y, you can find the boundaries due to the repetitive sequence. But this is just some reverse engineering approximation.

If you look at the series of digits in y, you can find the boundaries due to the repetitive sequence. But this is just some reverse engineering approximation.

OK, so I guess mentioning that in the text would be good! Also we need to check the 13 vs 14 groups thing.

Yes, we can do that. The good news is that the current example is still relevant.

Indeed, it could be possible that a writer did 2 series of digits.

We added a note about this