Distribution for bootsrap method

PvPDantes · 17 April 2022 10:09

Dear colleagues,

Which distribution (uniform, normal…) is used to randomly generate subsamples in bootstrap inside of bagging in sklearn? How the distribution influence on the subsamples results?

ArturoAmorQ · 19 April 2022 09:54

Inspecting the code used for bagging in scikit-learn you may notice that the indices are selected using numpy.random.randint which is a discrete uniform distribution of dtype=int.

I would argue that using a distribution that is not uniform may introduce an additional bias to the model, as an specific subset of data would be reinforced just because it is selected more often.

PvPDantes · 24 April 2022 21:21

Of course, it will introduce more bias. But if we expect to have e.g. normal distribution in the data itself as a natural phenomenon, therefore it would be better to train the model using “more natural” distribution for the specific case. What do you think about it?

glemaitre58 · 26 April 2022 14:22

Using a uniform sampling to sample some data allows you to get a new sample that has the same distribution as the original sample. Sampling using a normal distribution will alter the output distribution and it will not look alike the original distribution.

ogrisel · 27 April 2022 09:03

But if we expect to have e.g. normal distribution in the data itself as a natural phenomenon, therefore it would be better to train the model using “more natural” distribution for the specific case. What do you think about it?

I think Guillaume answered this concern but if this is not the case, here is another slightly different way to answer it:

The sampling strategy of the bootstrap is a uniform distribution over the finite set of the data points in the original dataset.

This is by construction a discrete distribution. It would not be possible to use a continuous distribution the normal distribution to sample data points from the original dataset. The normal distribution can only be use to sample real values, not element of a finite set.