OrdinalEncoder and unknown_value

When using the handle_unknown="use_encoded_value" parameter of the OrdinalEncoder class, how one should choose which value use to set the unknown_value ? What is the effect of setting it to -1 ?

It will depend on the subsequent predictive model.

If you are dealing with a linear model, this value will be multiplied by the associated model’s coefficient for the specific feature and will contribute to the target.

If you are dealing with a tree-based model, it is possible that the feature has been used to split the data. Let’s take a concrete example: at a given split, let’s have samples with 3 categories (0, 1, 2). In this case, the threshold to split data is likely to be whether 0.5 or 1.5; it will isolate samples from category 0 from categories [1, 2] or from category 2 from categories [0, 1]. Since unknown categories happen at predict time, it means that samples with category -1 will be grouped with a sample of category 0 at this specific tree node (but keep in mind keep in mind that samples a successively split into smaller batches into the trees).

It will depend on the previous analysis. If you get a linear model, it would be bad to affect the unknown category to a very large value. Putting it to zero will have no contribution to the target.

For the tree-based model, I think that this is more arbitrary and I am not sure that there is a right value to affect indeed.

1 Like

To add to @glemaitre58’s reply, OrdinalEncoder is almost never a good representation of categorical variables for linear models, whatever the value of unknown_value.

Tree-based models are robust and the precise value of unknown_value should not matter much as long as it does not collide with the encoded value of a non unknown category. Those start at zero by default, so using unknown_value=-1 is safe in that respect.

4 Likes

I think that setting of ‘unknwon_value’ in exercice M1.04 should be precised in the text since in the documentation no clue is given to which value is good for linear models.
As beginners we can t guess that ‘-1’ is the correct value…

1 Like