The noise

anasvir · 22 February 2022 15:41

Hello,
is there a way to detect the noise, and how we can avoid to model just the noise in a dataset ?
Thank you

ogrisel · 23 February 2022 10:17

There can be several kinds of noises, among which we can identify:

measurement imprecision from a physical sensor (e.g. temperature);
reporting errors by human collectors.

Those unpredictable data acquisition errors can happen either on the input features or in the target variable (in which case we often name this label noise).

Another source of noise is to consider unrelated variables as input features if those variables have no strong statistical association with the target. Since our dataset is limited in size, the model may decide to use those to make its prediction because on this small dataset they might appear related to the target variable just by chance. Note that this is a kind of “noise” only in the model predictions and in not the observed target variable itself.

However in practice, the most common source of “noise” is not necessarily a real noise, but rather the absence of the measurement of a relevant feature whose variations have a deterministic impact on the target variable (and possibly also the other observed features).

Since this missing/unobserved feature is randomly varying from one sample to the next, it appears as if the target variable was changing because of the impact of a random perturbation or noise, even if there was no significant errors made during the data collection process (besides not measuring the unobserved input feature).

One extreme case of this situation could be revealed by the fact that there exist in the dataset duplicated samples with exactly the same input feature values but different values for the target variable.

Apart from these extreme case, it’s hard to know for sure what should qualify or not as noise and which kind of “noise” as introduced above is dominating, without having access to a precise definition of the true (unknown) data generating process. But in practice, the best ways to make our predictive models robust to noise are to avoid overfitting models by:

selecting models that are simple enough or with tuned hyper-parameters as explained in this module;
increasing the number of labeled samples in the training set;
pruning input variables that are unrelated to the target variable (not covered in this MOOC, see: automated Feature Selection and more manual Model inspection);
(advanced) choosing a model with a loss function that is adapted or at least tolerates the noise in the target induced by missing informative input measurements (e.g. quantile regression or specialized functions such as Tweedie regression).

Finally, there can also be errors in the code in charge of performing the data collection or queries in a database. Most of the time, those errors are not random and therefore cannot really qualify as noise but more as an unwanted/unexpected systematic bias in the data acquisition process.

ArturoAmorQ · 23 February 2022 13:13

Thinking about this I came to an example:

Imagine you are trying to model house prices. One variable that will surely impact the price is the surface area, but the price is also affected by whether the selling person is in a rush and decides to sell below the market price. A model will be able to make predictions based on the former variable but not on the latter, so “seller’s rush” can only be regarded as noise and can never be isolated.

ogrisel · 23 February 2022 13:19

@ArturoAmorQ I think both of our replies would benefit from being integrated one way or another in the content of the MOOC. For instance, we could have a dedicated section “What is noise in Machine Learning?” at the end of a notebook.

What do you think?

ArturoAmorQ · 23 February 2022 13:26

I love the idea! Should I tag it for session 3 or for this session?

lesteve · 23 February 2022 14:00

Go for it! To make sure we remember this, create a github issue mentioning this forum post