Hi! I understand that collinearity between features must be avoided for linear models (e.g: we have dropped the education-num
column). However, would this also affect a Decision-Tree-based classifier as well??
Thanks in advance!
In DecisionTree
, RandomForest
, and GradientBoosting
, the features are picked at random. Therefore, in the case of collinearity, the model will pick up randomly education
or education-num
when this feature is the best feature. In this case, we would then expect these models to give 1/2 of the importance to both features.
For HistGradientBoosting
, we will always pick up one feature (the first one explored) because we don’t have randomization (it would be better to modify this part indeed). Therefore, if education
is used then the model will ignore education-num
and vice versa.
Unlike linear models, we will not have numerical issues while training.
1 Like