Confusing definition of bootstrap sample

hafeezcse · 1 April 2022 05:13

Hi,
I am confused by the following definition.

A bootstrap sample corresponds to a resampling with replacement, of the original dataset, a sample that is the same size as the original dataset.

“a sample that is the same size as the original dataset”
I don’t get what the “sample” means here.

But even if it refers to the “bootstrap sample”, then I think the sample is a subset of the dataset as shown in the video. Hence it can not be the “same size as the original dataset”.

Please help me with this.

ArturoAmorQ · 1 April 2022 09:02

Maybe the problem with our current definition is the word “size”. Maybe a better wording would be something similar to

Bootstrapping corresponds to a resampling with replacement of the original dataset. A bootstrap sample is then a sample that has the same values of the input features and target as a given sample in the original dataset. Thus, the bootstrap resampling will contain some data points several times while some of the original data points will not be present.

We will improve our wording, thanks for the feedback!

ogrisel · 5 April 2022 15:44

Since it is a re-sampling with replacement, there can be copies and the bootstrap sample has the same size (number of data points) has the original dataset with duplicated.

Here is a simple example of bootstrap resampling of a dataset with 3 elements with numpy:

>>> import numpy as np
>>> original_data = np.asarray(["a", "b", "c"])
>>> original_data
array(['a', 'b', 'c'], dtype='<U1')
>>> np.random.choice(["a", "b", "c"], size=3, replace=True)
array(['c', 'b', 'b'], dtype='<U1')
>>> np.random.choice(["a", "b", "c"], size=3, replace=True)
array(['a', 'c', 'c'], dtype='<U1')
>>> np.random.choice(["a", "b", "c"], size=3, replace=True)
array(['b', 'a', 'a'], dtype='<U1')
>>> np.random.choice(["a", "b", "c"], size=3, replace=True)
array(['b', 'b', 'c'], dtype='<U1')
>>> np.random.choice(["a", "b", "c"], size=3, replace=True)
array(['a', 'a', 'c'], dtype='<U1')

Each time the resulting array has size 3.

hafeezcse · 7 April 2022 04:23

Hi,
I am convinced by the example here and also by the definition. However, the confusion was created by the slide of the video lecture, as the size (number of data points) of samples is less than that of the dataset as shown below.

ogrisel · 7 April 2022 08:28

Actually the slides are fine, it’s just that it’s not possible to visualize that some samples are duplicated (and therefore hide one-another). I should have made that clearer in the audio.

hafeezcse · 7 April 2022 09:50

Fine. and good to know who is delivering the video lectures.

glemaitre58 · 8 April 2022 09:42

I think that the subsequent notebook that uses different transparency to show the replicated sample will lift the doubt.

ogrisel · 8 April 2022 09:55

I tagged this conversation as suggestion + mooc-v3 as potential improvement of the video to avoid the confusion for a future version of the mooc.