Small clarification in 'Sample Grouping' notebook

PiaBrinkmann · 28 June 2021 09:36

Dear all,

I find the explanation below the print(digits.DESCR) cell a bit confusing. It is stated ‘If we read carefully, 13 witers wrote the digits of our dataset…’ but actually, 43 writers wrote the digits (30 wrote the training set digits and 13 different wrote the test set). Maybe rephrase this into ‘if we read carefully, 13 writers wrote the digits in the test set, while 30 different writers wrote the digits in the training set’. Thank you for your great work and best wishes,
Pia

lesteve · 28 June 2021 14:00

So what is happening is that sklearn.datasets.load_digits only return the test set of the original UCI dataset:

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

So I believe the text is actually correct, meaning there are indeed 13 different writers that have annotated the digits in our dataset.

We may try to clarify the wording to be more clear though …
Edit: We also need to look at these 13 vs 14 groups see below …

camille-anne · 29 June 2021 15:22

Hi,
By the way I see 14 groups defined in the notebook (indexes 0 to 13) …?
Best wishes,
Camille

lesteve · 1 July 2021 09:36

Indeed … to be honest I am not sure where we got the annotaters boundaries from, maybe @glemaitre58 remembers?

glemaitre58 · 1 July 2021 10:02

By looking closely at the data by hand If you look at the series of digits in y, you can find the boundaries due to the repetitive sequence. But this is just some reverse engineering approximation.

lesteve · 1 July 2021 11:42

If you look at the series of digits in y, you can find the boundaries due to the repetitive sequence. But this is just some reverse engineering approximation.

OK, so I guess mentioning that in the text would be good! Also we need to check the 13 vs 14 groups thing.

glemaitre58 · 5 July 2021 07:26

Yes, we can do that. The good news is that the current example is still relevant.

glemaitre58 · 5 July 2021 07:28

Indeed, it could be possible that a writer did 2 series of digits.

lesteve · 28 January 2022 16:52

We added a note about this