Several questions about ensemble

Hello,

I found this topic very interesting and have several questions about it.

1/ First, could you tell me a little bit more about the mathematical reasons that makes the aggregation work ? what is the mathematical formalism to describe this procedure and explain why error is decreasing ?
In some sense, the process made me think about the law of large numbers (in french “loi forte des grands nombres”) when you somehow aggregate, the more events/predictions you have the closer you get to the mathematical expectation.

2/ At one point one talk about decorrelated errors, but what does it mean, what is a correlated error exactly ?

3/ What is R2 score, what the ‘R2’ stands for ? I did not remember there was an explanation about this (but my bad if I did not pay enough attention)

4/ The way ensemble are created made me think about mini batch for gradient descent, especially when the training is sequential (the fact that we do different resampling and subset and we adapt the model at each step to try to correct the error). Is there any resemblance or the comparison is missleading here ?

5/ Could you explain a bit further what is the point and what happens when we put larger weights to the mispredicted points ? How does it make the model then focus on those particular weights exactly ?
Regarding the first example of the video concering this, I may have an idea but I don’t know if it’s true and if it could generalize. As the mispredicted points were very high, I assumed that the decision tree may elaborate its decision rule with the mean and so give larger weights artificially increase the mean in the region where there were mispredictions such that the rule represented by the horizontal separation line is high and close to the data points mispredicted with the big wheghts. But I’m not sure at all that it’s what is really happening ^^ Could you plz give more details about what is going on ?

Thank you again !

Geoffrey

If you recall the chapter about overfitting, we mention the concept of bias and variance. Aggregating the predictions of several trees will reduce the variance while the bias is really low. There is a very good explanation in “Element of Statistical Learning” in Sect. 15.2 / pp. 587-586. I am not sure that I can explain better than this short paragraph so I prefer to link it directly.

Do not hesitate if something is not clear from what is written.

Could you provide the notebook with the section (or the video with some timing) such that I can get the concept that we wanted to introduce (to be honest I don’t recall exactly :slight_smile: )

The R2 is also known as the coefficient of determination. Indeed, we might use it here (it is the default score for regressor in scikit-learn) while we are presenting it officially in the next module. Maybe the only thing to know here is that this is a score: 1 is the best possible score, 0 would be equivalent to a model predicting the mean, and negative values represent bad models.

Only focusing on gradient boosting, it is indeed a gradient-descent but in the functional space. Nicolas Hug (a scikit-learn developer) wrote a blog post regarding specifically this topic: Understanding Gradient Boosting as a gradient descent | Nicolas Hug I find the post quite intuitive and I prefer to direct you there and if you have any question, do no hesitate to comment :slight_smile:

This is more related to AdaBoost here. By changing the weight, we are changing the error. Indeed, having a small weight will induce that making a classification mistake for this point will be reduced by multiplying by this weight. On the contrary, a high weight will increase the error. Since we will find a model to have the lowest possible error, it will just try to to well classify point with high weights.

3 Likes

Than you very much for you detailed and very informative answer.

Yes sorry for the notebook, it’s in Ensemble method using bootstrapping – Random Forest Notebook. But we speak about it mainly in the video, the slide about random forest when it is said extra randomization decorrelates the errors. I know we speak about this and extra trees later but I don’t remember exactly where.

I will check for the reference, thanks again !

Geoffrey