Hello,
Do you have a rule of thumb that let you tell noisy data from what it isn’t ?
For example, if a data point is more than a standard deviation away from the mean we could say that it’s noise so we could remove the noise from the dataset.
Thank you for this great course !
What you propose is known as outlier detection: you make some assumptions on the data distribution and will reject samples that are “out-of-distribution”.
There is a scikit-learn page to discuss the topic more in detail: 2.7. Novelty and Outlier Detection — scikit-learn 1.0.2 documentation
I have as well a toy example that makes use of such an algorithm to “clean” the dataset before fitting a model: Customized sampler to implement an outlier rejections estimator — Version 0.9.0
In practice, it might be easier to use an estimator that is robust to outlier either because of the loss used (e.g. HuberRegressor
, QuantileRegressor
that would provide an estimate of the median) or by tuning their hyperparameters (e.g. max_leaf_nodes
of HistGradientBoosting
) to avoid the overfitting.
Thank you @glemaitre58 for this well documented answer and for the practical advise. This subject goes far beyond from what I expected and I’m glad to discover that scikit-learn (and imbalanced learn) have tools to deal with it.
Have a nice day !