Use of handle_unknown for several (rarely-occuring) categories

Miguel_Llamas_Lanza · 4 April 2022 16:53

Regarding the handle_unknown parameter in OrdinalEncoder, what should we do if we had more than one category that occurs rarely (in our case it is only Holand-Neatherlands).
I assume, that using (handle_unknown = “use_encoded_value”, unknown_value = -1), will encode all categories not passed to the training with -1. Therefore, the model will recognise all those categories as just one (as they are encoded with the same number).

Is there anyway of avoiding this? i.e. is there any way of encoding each of those unknown categories with different numerical values (even if this numerical values are randomly chosen)?

ArturoAmorQ · 5 April 2022 09:25

In the near future (next scikit-learn release), we will have a more dedicated strategy for these infrequent categories by setting handle_unknown="infrequent_if_exist" as mentioned here in the documentation:

Miguel_Llamas_Lanza · 21 April 2022 14:31

Thank you for your answer. The hyperlink to the docs does not work for me. For others, the url with the aforementioned documentation is https://scikit-learn.org/dev/modules/preprocessing.html#encoding-categorical-features

ArturoAmorQ · 21 April 2022 17:39

You are right @Miguel_Llamas_Lanza, I made a mistake with the hyperlink. It should work now.