Data loss during encoding

Hello, isn’t there a risk that data could be lost with encoding ?
Isn’t it a great risk for a dataset ?
Thank you for your answer.
Regards,

1 Like

Hello!

In scikit-learn, encoders have a method inverse_transform that can convert data back to its original representation. As mentioned in the notebook “Encoding of categorical variables”, in some cases we may have some categories occurring rarely, in this case Holand-Netherlands and Hungary. If all the samples from both categories end up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode them. Using the parameter handle_unknown in such cases may indeed lead to data loss in the sense that inverse_transform would map them both into None. In that scenario you can bypass the issue by listing all the possible categories and provide them to the encoder via the keyword argument categories.

In any case, keep in mind that having imbalanced classes will necessarily harm the generalizability of any model trained on them.

4 Likes

I an unrelated topic, I edited the title of this question to make it more informative about the content, unfortunately losing its previous friendliness.

Thanks !