Question Number 5

This is likely because I’m just being dumb today. I thought with RandomForest, the subsetting happens both at the sample level as well as the feature level. It is this latter additional randomization that led me to think that there are “random splits”. The fact that bootstrapping is happening here is because randomforest is just a bagging algorithm with a decision tree as the base estimator. So I’m not sure if it my understanding that is confused, or if the wording could be better :slight_smile: -Pritam

I think that the question is clear because it starts with “For a given feature” so this is independent of the choice of a feature. “random split” would be opposed to “best split”.

Hi, I don’t fully understand the good answer:

the other answers are clearly wrong, but even if bootstrap=True by default, the other param max_samples=None also and then the training is done with X.shape[0] samples (doc of RandomForest)

I don’t get the hint about some out-of-the-bag samples; only a oob-score is available, but I don’t understand exactly the number or the ratio of oob-samples in the default case.

If someone can enlighten…

I am not sure to which ratio you are referring?. Could you be more explicit?

Maybe this answer can help: out of bag samples are samples not selected due to bootstrapping (sampling with replacement). The theory says that for a large number of samples only 63.2% of unique data samples will be selected (36.8% of the data points will be a repetition of selected samples). The out-of-bag samples are therefore these 36.8% of samples that were not selected and thus not used in the training phase. With an overfitted tree, it is then unlikely that we properly classify these OOB samples.

But I am not sure that your question was related.

Ah, thanks, that’s it indeed.
I thought answer a) was a trap because even with bootstrap=True, the max_samples=None implies that the number of samples is X.shape[0]: I believed it simply takes the whole bunch of samples.
Sorry for that, I forgot that in fact, it’s a draw of the same number of samples but with replacement and the ratio of selected samples corresponds to an occupancy problem.

Many thanks again!