Priorizing certain features in RandomForrest

albatros69 · 10 June 2021 09:48

Hello,
Random Forest trees are trained using a subset of features. Is it possible to tell that some features are always included in the training process or at least give them a higher probability of being chose (because we know they have a major influence)?
Regards,

lesteve · 10 June 2021 12:36

I don’t think it is possible in scikit-learn.

It feels like this would also go against one of the idea of RandomForest which is to create diverse trees so that making an ensemble of these diverse trees performs well. By giving higher probability to some features you would make the trees more similar between each other.

Having said that, this is the kind of tweaks that someone may have tried already, and you may find articles about this …

albatros69 · 10 June 2021 13:00

Maybe I’ll give an example. I’ve a dataset with a few features (~5). One of them contains free text. I extract the words of this field using CountVectorizer, ending up with hundreds of features, which are quite hollow, with a lot of 0. It seems counterintuitive to me to treat these created features exactly the same as the original ones, thus this idea of giving them different weights to grow the trees.

But maybe, I’m missing an important point and the one you mention (decorrelating the different trees) is definitely one of them. I’m however curious on the impact on the performance scores of the predictions to take more into account important features.

glemaitre58 · 10 June 2021 13:14

If the feature is important, the tree will pick it up. Do not forget that for each node of the tree, we will only keep a single feature. The bootstrapping on the feature will just mask some of the features for some node randomly.

My intuition would be that the ensemble will overfit more The randomness part of this algorithm is a way to alleviate overfitting. It is a bit in the same spirit as “dropout” in deep learning for instance. One uses these tricks to not memorize the dataset. Giving explicit weight to some features is going against this principle.

This is my 2 cents on that

albatros69 · 10 June 2021 13:31

I was actually forgetting this!
Thanks a lot for the answer…

glemaitre58 · 10 June 2021 13:46

With a small discussion IRL with @lesteve, since we pick a subset of features at random, it might be required to create deep individual trees. But it is more a hunch.

albatros69 · 10 June 2021 14:31

If I’m not misreading the scikit documentation, it seems to be the default (max_depth=None). Or am I?

glemaitre58 · 10 June 2021 15:07

Yes, this is the default.