Standard Scaling in different distributions

Rowels · 8 March 2022 05:29

Hi, there are unexplained details with, not only Standard but others scaling methods that we need to take in consideration when data do not follows a Gaussian?

It doesn’t matter?

Thank you in advance.

glemaitre58 · 8 March 2022 09:41

Let’s say that we use StandardScaler to ensure that all features are in the same range of values. We do not intend to change the distribution of the original features.

However, normality is usually an assumption of some of the models (linear models for instance), and not having Gaussian distribution might not be optimal. That’s indeed one of the reasons that there are several other preprocessing methods such as PowerTranformer or QuantileTransformer that can transform the feature to follow a Gaussian distribution.

I don’t know if, in practice, we will always seek normality even with a model that is expecting it. I would like to know what insights @ogrisel has about this topic.

ogrisel · 9 March 2022 17:42

I don’t think the linear models even assume normality of the input features. Linear regression generally assumes normality (and independence) of the residuals.

The presence of outliers in the input features can be a problem though, especially with a finite number of samples. I can cause StandardScaler to behave badly and probably cause unstable estimate of the decision function as well.

But in general, it is rarely the case that the data generating process actually follows a linear model w.r.t. the original numerical feature space. It is quite common to perform non-linear feature engineering to give more expressiveness to the model (e.g. with splines and kernel expansions).