Rule of thumb for noise?

VictorVila · 17 March 2022 06:41

Hello,
Do you have a rule of thumb that let you tell noisy data from what it isn’t ?
For example, if a data point is more than a standard deviation away from the mean we could say that it’s noise so we could remove the noise from the dataset.
Thank you for this great course !

glemaitre58 · 17 March 2022 09:01

What you propose is known as outlier detection: you make some assumptions on the data distribution and will reject samples that are “out-of-distribution”.

There is a scikit-learn page to discuss the topic more in detail: 2.7. Novelty and Outlier Detection — scikit-learn 1.0.2 documentation

I have as well a toy example that makes use of such an algorithm to “clean” the dataset before fitting a model: Customized sampler to implement an outlier rejections estimator — Version 0.9.0

In practice, it might be easier to use an estimator that is robust to outlier either because of the loss used (e.g. HuberRegressor, QuantileRegressor that would provide an estimate of the median) or by tuning their hyperparameters (e.g. max_leaf_nodes of HistGradientBoosting) to avoid the overfitting.

VictorVila · 17 March 2022 20:26

Thank you @glemaitre58 for this well documented answer and for the practical advise. This subject goes far beyond from what I expected and I’m glad to discover that scikit-learn (and imbalanced learn) have tools to deal with it.
Have a nice day !