Do a short summary about one-hot vs ordinal encoding

pasquet_syl · 22 April 2021 14:23

Not much to say for this one except :

could you show an example of how to use the categories in the OrdinalEncoder ?
could you give us hints on when to use ordinal vs one hot encoding ?

lesteve · 29 April 2021 14:08

Look at https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features (search for “categories=” to find the exact right place). We probably don’t want to go into more details in this notebook because the goal is to quickly build a predictive modelling pipeline and not go over all the possible variations that you can use.

I think we kind of say it a bit in the text but maybe you want more? What we say in the notebook (I think):

for linear models use one-hot unless there is a natural order within your category (e.g. t-shirt sizes S, M, L, XL, etc …). If there is a natural order make sure that the ordinal encoding follows this ordering
for tree-based models, use ordinal encoding, it doesn’t increase the number of columns, and tree-based models are able to deal with ordinal encoding even if the ordering is arbitrary.
those are simple guidelines, in reality it can be more complicated, so even if there is a natural order one-hot encoding may work better, and maybe using both encodings can be something to try as well …

pasquet_syl · 29 April 2021 14:20

Hum I don’t remember seeing these examples…I might have missed them.
It’s worth checking, and maybe highlight them to make it more visible…

lesteve · 29 April 2021 14:33

Agreed probably the information is a bit spread in different places maybe we need some kind of summary to highlight the take-home messages.

I am going to edit the topic title to make it a bit more about adding a short summary about one-hot vs ordinal encoding.

glemaitre58 · 6 May 2021 10:21

Solved in https://github.com/INRIA/scikit-learn-mooc/pull/331

lfarhi · 10 May 2021 15:41