Bootstrap sampling

allermat · 20 June 2021 12:13

Hi,

Looking at the Bagging lecture, I was a bit confused by the way subsampling was implemented in the function bootstrap_sample. It seems, that for each bootstrap sample, the dataset is resampled with replacement one sample at a time. This means that in each bootstrap sample a given data point can be randomly sampled several times (as pointed out later in the explanation).

Is this also how bootstrap sampling is implemented in scikit-learn or is this just for the sake of example?

My understanding about bootstrapping was that for a given bootstrap sample we take n samples from the dataset at random, then for the next bootsrtap sample, we take n samples again randomly from the whole dataset, etc. Thereby the with replacement criterion applies across bootstrap samples not within a given bootstrap sample. This is how the video lecture seems to describe the bootstrapping procedure too (i.e., in each bootstrap sample a given data point is included only once).

At any rate, I wonder how the classifier/regressor performance is affected if the training set contains mulitple identical datapoints (as implemented in bootstrap_sample())?

Manyt thanks,
Máté

glemaitre58 · 20 June 2021 17:21

I am a bit lost with your explanation

A bootstrap sample is a random sampling of the original dataset with replacement. Thus a data point can be duplicated as mentioned in the video (around 1:50). Bagging will create several bootstrap samples.

The function provided in the code allows to create a bootstrap sample and we call it several times, as many times as the number of the estimator in the bag.

Without the replacement (and without repeating samples), this sampling is not called a bootstrap.

Repeating a sample means that you are giving more weights to this specific sample.

allermat · 20 June 2021 18:03

Thanks for your quick response! Sorry for not being very clear. I was aware that sampling with replacement is the essence of bootstrapping, except I thought that a given datapoint can only be repeated across bootstrap samples, not within. Thanks for clarifying this!