Data loss during encoding

ginetteHM · 25 February 2022 15:11

Hello, isn’t there a risk that data could be lost with encoding ?
Isn’t it a great risk for a dataset ?
Thank you for your answer.
Regards,

ArturoAmorQ · 25 February 2022 15:58

Hello!

In scikit-learn, encoders have a method inverse_transform that can convert data back to its original representation. As mentioned in the notebook “Encoding of categorical variables”, in some cases we may have some categories occurring rarely, in this case Holand-Netherlands and Hungary. If all the samples from both categories end up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode them. Using the parameter handle_unknown in such cases may indeed lead to data loss in the sense that inverse_transform would map them both into None. In that scenario you can bypass the issue by listing all the possible categories and provide them to the encoder via the keyword argument categories.

In any case, keep in mind that having imbalanced classes will necessarily harm the generalizability of any model trained on them.

ArturoAmorQ · 25 February 2022 16:00

I an unrelated topic, I edited the title of this question to make it more informative about the content, unfortunately losing its previous friendliness.

ginetteHM · 25 February 2022 16:24

Thanks !