It's important to manage lacking categorical data

mab66 · 1 July 2021 13:11

Hello,

For categorical data (without changing nothing about numerical data and all things equal otherwise):

If i use the preprocessor (given by Metssie)
SimpleImputer(strategy=“most_frequent”)
i get the result 0,754

If i use the preprocessor given by the correction
SimpleImputer(strategy=“most_frequent”,fill_value=“missing”)
i get the result 0,742

How explain this diference ? What is the most convenient setting ? Is it a general rule to apply when processing categorical data ?

Thank you for answer

glemaitre58 · 1 July 2021 20:57

Be careful with improvements that you can observe. You are observing only the mean value of the score. You should also look at the standard deviation as well. These scores provide you with a distribution (you could visualize them with a histogram) and they might overlap by a lot. It would tell you that the gain is marginal and the improvement could only be linked to the noise or could not be considered really significant.

Changing the strategy of the imputer could be one of such random fluctuation (but one would need to look at the score distribution).