Hello,
I found this topic very interesting and have several questions about it.
1/ First, could you tell me a little bit more about the mathematical reasons that makes the aggregation work ? what is the mathematical formalism to describe this procedure and explain why error is decreasing ?
In some sense, the process made me think about the law of large numbers (in french “loi forte des grands nombres”) when you somehow aggregate, the more events/predictions you have the closer you get to the mathematical expectation.
2/ At one point one talk about decorrelated errors, but what does it mean, what is a correlated error exactly ?
3/ What is R2 score, what the ‘R2’ stands for ? I did not remember there was an explanation about this (but my bad if I did not pay enough attention)
4/ The way ensemble are created made me think about mini batch for gradient descent, especially when the training is sequential (the fact that we do different resampling and subset and we adapt the model at each step to try to correct the error). Is there any resemblance or the comparison is missleading here ?
5/ Could you explain a bit further what is the point and what happens when we put larger weights to the mispredicted points ? How does it make the model then focus on those particular weights exactly ?
Regarding the first example of the video concering this, I may have an idea but I don’t know if it’s true and if it could generalize. As the mispredicted points were very high, I assumed that the decision tree may elaborate its decision rule with the mean and so give larger weights artificially increase the mean in the region where there were mispredictions such that the rule represented by the horizontal separation line is high and close to the data points mispredicted with the big wheghts. But I’m not sure at all that it’s what is really happening ^^ Could you plz give more details about what is going on ?
Thank you again !
Geoffrey