Hi,
Looking at the Bagging lecture, I was a bit confused by the way subsampling was implemented in the function bootstrap_sample
. It seems, that for each bootstrap sample, the dataset is resampled with replacement one sample at a time. This means that in each bootstrap sample a given data point can be randomly sampled several times (as pointed out later in the explanation).
Is this also how bootstrap sampling is implemented in scikit-learn or is this just for the sake of example?
My understanding about bootstrapping was that for a given bootstrap sample we take n samples from the dataset at random, then for the next bootsrtap sample, we take n samples again randomly from the whole dataset, etc. Thereby the with replacement criterion applies across bootstrap samples not within a given bootstrap sample. This is how the video lecture seems to describe the bootstrapping procedure too (i.e., in each bootstrap sample a given data point is included only once).
At any rate, I wonder how the classifier/regressor performance is affected if the training set contains mulitple identical datapoints (as implemented in bootstrap_sample()
)?
Manyt thanks,
Máté