Question 4 - difference missing and unknown

geogeo14000 · 14 June 2021 07:17

Hi,

When preprocessing the categorical data we use SimpleImputer() and OrdinalEncoder which deals with unknown values.

My question is : unknown values are not the same and are not treated the same as missing values ? that’s why we have to use a SimpleImputer ? Because I tried to use OneHotEncoder without doing SimpleImputer before on categorical_data and I had an issue related to indices and columns which I think is linked to the absence of the Imputer.

The imputer is necessary to deal with missing values, whereas the Encoder can deal with unknown features, for example features it did not see during the training, for example because of the sample distribution, but it cannot deal with missing values, it’s two different things ?

Thanks,

Geoffrey

glemaitre58 · 14 June 2021 08:56

Indeed, missing and unknown have 2 different semantics.

Missing would be that you did not collect the data. In encoders, we use “unknown” that has a different semantic to cover the following case: you have a category that appears during testing but never seen during training. Thus, it is unknown from the point of view of the model but it is not missing because the data was collected.

Missing values will have a NaN values that are checked by the encoders and they will raise an error (in the future scikit-learn might discard them but this is not the case currently). Thus, you need to impute before passing to the encoders.

geogeo14000 · 14 June 2021 12:34

Ok great I get it thanks !