Which categorical encoding for models other than linear and tree-based mdoels

AndreaPie · 19 May 2021 21:24

Hi,

in the notebook Encoding of categorical variables you state that

Thus, in general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is used with tree-based models .

which is very good advice! But I wonder, which encoding strategy should be used for nonlinear, non-tree based models such as Neural Networks or GAMs? Am I right to suspect that OneHotEncoder should be used, unless the the original categories (before encoding) have an ordering, i.e., the same strategy as for linear models? Thanks!

glemaitre58 · 20 May 2021 07:54

Yes, you are right. OneHotEncoder should generally be used in these cases. Indeed, OrdinalEncoder would be a good default only for tree-based models. In this MOOC, we are presenting only linear and tree-based models; it explains why we did not make a more complex statement. However, we could consider being more explicit with some additional information for predictors that do not fit in these categories.

AndreaPie · 20 May 2021 12:18

Thanks for the answer! Mine was not a criticism - in the context of the MOOC, it makes sense to focus only on the predictors you’re going to talk about. But I wanted to know more, so I took the opportunity to ask here

glemaitre58 · 20 May 2021 12:36

We also accept criticism to improve the content