Collinearity and Gradient Boosted Machines

betoSibileau · 16 May 2022 10:15

Hi! I understand that collinearity between features must be avoided for linear models (e.g: we have dropped the education-num column). However, would this also affect a Decision-Tree-based classifier as well??
Thanks in advance!

glemaitre58 · 16 May 2022 11:30

In DecisionTree, RandomForest, and GradientBoosting, the features are picked at random. Therefore, in the case of collinearity, the model will pick up randomly education or education-num when this feature is the best feature. In this case, we would then expect these models to give 1/2 of the importance to both features.

For HistGradientBoosting, we will always pick up one feature (the first one explored) because we don’t have randomization (it would be better to modify this part indeed). Therefore, if education is used then the model will ignore education-num and vice versa.

Unlike linear models, we will not have numerical issues while training.

betoSibileau · 16 May 2022 12:41

Hi @glemaitre58 thanks for your prompt and kind answer!