To what I have gathered tree models are not affected by One hot encoding the values of the categorical variables, then why is OrdinalEncoder used in the first lecture? Except for the handling of unknown_values.
Tree model fitting is working in a brute force manner: to define a split, it will go over all features and find the best split per feature and only keep the best split of the best split.
Using a OneHotEncoder
will not decrease the statistical performance but it will be hugely costly in terms of computation. Indeed, OneHotEncoder
will create a feature per category and as explain in the previous paragraph, the tree will explore all these new features. OrdinalEncoder
allows to keep the number of features as in the original data and deal with the category.
Yes but OrdinalEncoder
converts the categorical features into numerical one so I am guessing it will increase the computation cost too without increasing the statistical performance cause the model HistGradientBoostingClaasifier
is based on decision trees, so the use of it is not really necessary in the lecture 1 of this module besides the handling of unknown values and setting them to -1?
You can go back to the categorical encoding section to have more illustrations.
Let’s take the case where you have a single feature with 10 categories.
OrdinalEncoder
will provide as an output a single feature where the ten categories are encoded with numerical values. A tree will have to explore this single feature to find a split.
OneHotEncoder
will provide as an output 10 features with 0/1 numerical entries. A tree will have to explore the 10 features to find the 10 best splits and then select the best one. Thus, OneHotEncoder
will make the tree 10 times slower than the OrdinalEncoder
for finding a similar criteria.
It was the point of the following exercise in the first module: 📃 Solution for Exercise M1.05 — Scikit-learn course
This enhanced my concepts on the two methods thank you so much.
should’ve done this before, sorry.