OrdinalEncoder parameters

laurabrz · 18 October 2022 12:47

Hello, I didn’t understand the handle_unknown and unknown_value parameters. I have tried handle_unknown="use_encoded_value" and unknown_value=np.nan, but it doesn’t work. A nan value appears in the test_score. What value should I use for the parameter unknown_value? Thanks.

cb67 · 18 October 2022 16:12

Hello Laura,

I had the same issue with the parameter. I used np.nan but the LogisticRegression object does not accept NaN values into the computation.

So I tried unknown_value=0. But it happened to be an encoded value. It does not work as well.

Hence I tried unknown_value=-9 and it worked!

So I understand you can use a value that is not used as an encoded value.

It worked for -99 as well but not 1 or 2… I got it from the error message:

ValueError: The used value for unknown_value 2 is one of the values already used for encoding the seen categories.

MarkoZ · 18 October 2022 18:33

handle_unknown {‘error’, ‘use_encoded_value’}, default=’error’

When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.

MarkoZ · 18 October 2022 18:34

unknown_value int or np.nan, default=None

When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

MarkoZ · 18 October 2022 18:34

laurabrz · 19 October 2022 06:53

Hello,

Thank you! I used unknown_value=-1 and it works as well! I think we should choose a category that doesn’t exist to the missing values.