OrdinalEncoder and decision tree model?

Harshit_Sati · 7 June 2021 18:29

To what I have gathered tree models are not affected by One hot encoding the values of the categorical variables, then why is OrdinalEncoder used in the first lecture? Except for the handling of unknown_values.

glemaitre58 · 8 June 2021 07:13

Tree model fitting is working in a brute force manner: to define a split, it will go over all features and find the best split per feature and only keep the best split of the best split.

Using a OneHotEncoder will not decrease the statistical performance but it will be hugely costly in terms of computation. Indeed, OneHotEncoder will create a feature per category and as explain in the previous paragraph, the tree will explore all these new features. OrdinalEncoder allows to keep the number of features as in the original data and deal with the category.

Harshit_Sati · 8 June 2021 09:23

Yes but OrdinalEncoder converts the categorical features into numerical one so I am guessing it will increase the computation cost too without increasing the statistical performance cause the model HistGradientBoostingClaasifier is based on decision trees, so the use of it is not really necessary in the lecture 1 of this module besides the handling of unknown values and setting them to -1?

glemaitre58 · 8 June 2021 09:29

You can go back to the categorical encoding section to have more illustrations.
Let’s take the case where you have a single feature with 10 categories.

OrdinalEncoder will provide as an output a single feature where the ten categories are encoded with numerical values. A tree will have to explore this single feature to find a split.

OneHotEncoder will provide as an output 10 features with 0/1 numerical entries. A tree will have to explore the 10 features to find the 10 best splits and then select the best one. Thus, OneHotEncoder will make the tree 10 times slower than the OrdinalEncoder for finding a similar criteria.

glemaitre58 · 8 June 2021 09:30

It was the point of the following exercise in the first module: 📃 Solution for Exercise M1.05 — Scikit-learn course

Harshit_Sati · 8 June 2021 09:35

This enhanced my concepts on the two methods thank you so much.

should’ve done this before, sorry.