Do a short summary about one-hot vs ordinal encoding

Not much to say for this one except :

  • could you show an example of how to use the categories in the OrdinalEncoder ?
  • could you give us hints on when to use ordinal vs one hot encoding ?

Look at https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features (search for “categories=” to find the exact right place). We probably don’t want to go into more details in this notebook because the goal is to quickly build a predictive modelling pipeline and not go over all the possible variations that you can use.

I think we kind of say it a bit in the text but maybe you want more? What we say in the notebook (I think):

  • for linear models use one-hot unless there is a natural order within your category (e.g. t-shirt sizes S, M, L, XL, etc …). If there is a natural order make sure that the ordinal encoding follows this ordering
  • for tree-based models, use ordinal encoding, it doesn’t increase the number of columns, and tree-based models are able to deal with ordinal encoding even if the ordering is arbitrary.
  • those are simple guidelines, in reality it can be more complicated, so even if there is a natural order one-hot encoding may work better, and maybe using both encodings can be something to try as well …

Hum I don’t remember seeing these examples…I might have missed them.
It’s worth checking, and maybe highlight them to make it more visible…

Agreed probably the information is a bit spread in different places maybe we need some kind of summary to highlight the take-home messages.

I am going to edit the topic title to make it a bit more about adding a short summary about one-hot vs ordinal encoding.

Solved in https://github.com/INRIA/scikit-learn-mooc/pull/331