Residuals, parallel running and scale

geogeo14000 · 15 June 2021 12:18

Hello again,

I have 3 questions concerning this part of the MOOC.

1/ I did not understand the part in the lecture about GBDT with residual where it is said the second tree makes no error (perfect prediction) and predicts perfectly whereas there is still an error according to what is printed (Error of the tree: 0.118. But maybe it’s the whole tree and not the second subtree ? In this case, I still don’t see how we see that the second tree predictions is perfect and how it operates). I can’t figure out where the 0.264 comes from, how was it computed plz ?

2/ We say thant random forest can be speed because it can run on several cores in parallel but we also say that gradient boosting is very fast. In conclusion, which strategy is the fastest ?
On some hardware, can the parallel process be nearly as fast or even more as gradient boost ? Will quantum computing lead to an even faster random forest ? (well just joking for the last question ^^)

3/ It is said " The histogram gradient-boosting is the best algorithm in terms of score. It will also scale when the number of samples increases, while the normal gradient-boosting will not."
I don’t really understand the meaning of ‘scale’ here. what does it mean concretly that it will scale or it will not scale ?

Thanks a lot !

Geoffrey

glemaitre58 · 15 June 2021 13:25

Actually, we had a bug in selecting the right sample. You might need to synchronize your notebook (File → Revert to original). In case you don’t want to loose your changes, you can check the latest static version that we host in the jupyter-book: Gradient-boosting decision tree (GBDT) — Scikit-learn course

There you will see that the error is indeed 0.0 for the selected sample.

The GradientBoostingClassifier/Regressor will start to be to slow with 100,000+ samples. Random forest will still be able to use the parallelization. However, the trick used to discretize the data in HistGradientBoostingClassifier/Regressor make that this model will be able to scale where the normal gradient boosting was not able. Bottom line, HistGradientBoosting is the state of the art.

If you build a lot of trees and have a lot of available cores, it will be optimal for the random forest algorithm.

Yes, we are bad people. We used some jargon here. Scale in this context means that our algorithm can be trained efficiently even when the number of samples is increasing. This is linked with the complexity of the algorithm. For instance, we say that SVM to not scale because when the number of samples becomes too large, the training is too long to be used in practice. The same with the normal gradient boosting. However, with the discretization trick, we reduce the complexity of the training and we can use thus model on larger dataset.

geogeo14000 · 15 June 2021 13:31

Ok great I better understand a huge thank for all these valuable explanations !

Geoffrey