Distribution for bootsrap method

Dear colleagues,

Which distribution (uniform, normal…) is used to randomly generate subsamples in bootstrap inside of bagging in sklearn? How the distribution influence on the subsamples results?

Inspecting the code used for bagging in scikit-learn you may notice that the indices are selected using numpy.random.randint which is a discrete uniform distribution of dtype=int.

I would argue that using a distribution that is not uniform may introduce an additional bias to the model, as an specific subset of data would be reinforced just because it is selected more often.

1 Like

Of course, it will introduce more bias. But if we expect to have e.g. normal distribution in the data itself as a natural phenomenon, therefore it would be better to train the model using “more natural” distribution for the specific case. What do you think about it?

Using a uniform sampling to sample some data allows you to get a new sample that has the same distribution as the original sample. Sampling using a normal distribution will alter the output distribution and it will not look alike the original distribution.

3 Likes

But if we expect to have e.g. normal distribution in the data itself as a natural phenomenon, therefore it would be better to train the model using “more natural” distribution for the specific case. What do you think about it?

I think Guillaume answered this concern but if this is not the case, here is another slightly different way to answer it:

The sampling strategy of the bootstrap is a uniform distribution over the finite set of the data points in the original dataset.

This is by construction a discrete distribution. It would not be possible to use a continuous distribution the normal distribution to sample data points from the original dataset. The normal distribution can only be use to sample real values, not element of a finite set.

1 Like