OrdinalEncoder con unknown_value

Il testo suggerisce di utilizzare il utilizzare la trasformata OrdinalEncoder con il parametro unknown_value.
Ho provato a leggere la documentazione senza riuscire a capire come si usa il parametro. In particolare ho provato a mettere unknown_value=-1 oppure unknown_value=1000 ed apparentemente ottengo lo stesso risultato…

4 Likes

(Sorry my answer will be in English)

I will just start to state why we want to use this option.

Let’s say that you have 2 datasets, training and testing sets. Calling fit will learn the categories and map them to integral values in the range [0, n_categories]. Calling transform will map a category to its associated numerical value seen at fit.

A problem arises when a category to be transformed was not seen during fit. In this case, we cannot map it to any numerical values and thus it will raise an error. We could be more lenient by deciding that any unknown categories have seen during transform should be mapped to an arbitrary integer (e.g. -1 or 1000).

Thus in scikit-learn, you need to create an ordinal encoder as:

encoder = OrdinalEncoder(
   handle_unknown="use_encoded_value", unknown_value=-1
)

In this case, a category not seen during fit will be transformed to -1 without raising an error.
If all categories were present during fit then you will not see any difference in the resulting transform.

In practice, this parameter is useful with rare categories and when using cross-validation: depending on the cross-validation split the rare category can be whether in the training or testing set. If in the testing set, it will then raise an error if we don’t do the above trick.

8 Likes

Thank you very much for your reply that I found very clear and complete at the beginning.
But today I turned back considering the specific case in the notebook. I noticed that the option handle_unknown is mandatory, otherwise the following cross-validation fails. It means that there are some occurrences where rare categories are in the training set but not in the testing set. In this cases I should expect a different behaviour when setting a different value for the unknown_value parameter. Shouldn’t I?

I was less lucky than Marco. I had to try several unknown_value before it was accepted. 1, 2 for instance returned an error saying that the value was already used. I had to choose 1000 to make it pass.

Is there a way to choose a value that will never be used during the fit, even if we change the train data set?

OrdinalEncoder will encode from 0 to n_categories. So -1 would be a good default value.

2 Likes

Setting unknown values to be -1 worked for OrdinalEncoder, but not for OneHotEncoder since it does not even have a unknown_value as an option. Why is this and how to avoid having zero scores in OneHotEncoder?

EDIT: So the answer is simply set handle_unknown = ignore

1 Like

Hello !

It’s clear but how do we know we have to set theses parameters (handle_unknown=“use_encoded_value”, unknown_value=-1) ?
Should we consider them as default set-up for any other dataset case ?

And is the “use_encoded_value”, the unknown_value (here -1). So that theses 2 parameters are reliant on each other ?

Thank you in advance !