Hello, I didn’t understand the handle_unknown
and unknown_value
parameters. I have tried handle_unknown="use_encoded_value"
and unknown_value=np.nan
, but it doesn’t work. A nan
value appears in the test_score
. What value should I use for the parameter unknown_value
? Thanks.
Hello Laura,
I had the same issue with the parameter. I used np.nan
but the LogisticRegression
object does not accept NaN values into the computation.
So I tried unknown_value=0
. But it happened to be an encoded value. It does not work as well.
Hence I tried unknown_value=-9
and it worked!
So I understand you can use a value that is not used as an encoded value.
It worked for -99 as well but not 1 or 2… I got it from the error message:
ValueError: The used value for unknown_value 2 is one of the values already used for encoding the seen categories.
handle_unknown {‘error’, ‘use_encoded_value’}, default=’error’
When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value
. In inverse_transform
, an unknown category will be denoted as None.
unknown_value int or np.nan, default=None
When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit
. If set to np.nan, the dtype
parameter must be a float dtype.
Hello,
Thank you! I used unknown_value=-1
and it works as well! I think we should choose a category that doesn’t exist to the missing values.