Weight of data points in bootstrap sample

MSavel · 11 December 2022 21:26

Hi,

I am questioning on the following :
When a bootstrap sample is created , some points appear multiple times in the generated bootstrap dataset. When you fit a model on this bootstrap dataset, does these duplicates have an influence on the tuning of the parameters of the model ? Or would it be the same if you fit the model on a dataset which would contain exactly the same data points but without the duplicates ?

Thank you !

lesteve · 14 December 2022 09:26

The repeated data points have an influence on the model learned parameters, in a sense repeated points are seen as more important by the model.

For example if your model is a linear regression, the model parameters minimize a mean squared loss loss similar to this:

loss = sum((model(x_i) - y_i)**2 for x_i, y_i in zip(data_train, target_train))

Repeated data points counts multiple times in this loss.

For trees, this is a little bit less straightforward to visualize but they will also affect which splits are chosen.

MSavel · 20 January 2023 08:06

Thank you for your answer, it is now clear for me.