Preprocessing: theoritical results

Hello,

Thanks a lot for your content!

I have a question about preprocessing: does it THEORITICALLY improve model performance ?

If we take linear regression, before preprocessing and we assume that the X matrix is full ranked. Il tXX is invertible (and that the difference of scale doesn’t cause rounding issue into the calculus of inverse) I don’t see why preprocessing would improve the prediction.

Same with gradient descent: different scale will cause issues/delay convergence(because the gradient might over focus on one direction), but let’s assume we find a convergence, will it improve prediction ?

If answer is no, but you have exemple of other models that will suffer without preprocessing, I’m more than interested. I took regression as it’s a model where we look over Euclidian distance (so sensitive to scale difference)

Thanks a lot,

1 Like

If the model that you are playing with has a convex loss, the minimum and thus the best parameters will be exactly the same. However, you might not have the time to converge as you are stating. This is indeed why we propose to do scaling with linear models. We don’t expect the model to perform better in terms of statistical performance.

I took regression as it’s a model where we look over Euclidian distance (so sensitive to scale difference)

Model using Euclidean distance will be a model where not having the same data range will degrade the performance. I recall discussing the issue with @ogrisel while writing about this specific sentence where we thought that a KNeighborsClassifier in a high dimensional space (a lot of features) will have issues with unscaled features. In the same spirit, a KMeans algorithm will go sideways as well.

1 Like

To complement @glemaitre58 answers, it also depends on the kind of preprocessing. I assume you have scaling numerical features in mind, in which case, Guillaume’s answer should clarify your concerns.

But when preprocessing numerical features for linear models, there exist other kinds of preprocessing, for instance: QuantimeTransformer, KBinsDiscretizer, SplineTransformers, Nystroem… All of those preprocessors transform numerical features in a non-linear manner and therefore will change the expressivity of a pipeline that uses those in conjunction with a downstream linear model, and can therefore improve or degrade the model performance depending on the true (unknown) data generating process and the size of the training set.

1 Like

That is an interesting question.

Regarding the theory, not going into any proof though, I would say (and I’m talking under the supervision of the staff) that scaling is statistically important as:

  • For linear regression problems, we want the model to be homoscedastic, i.e. the noise is independent and identically distributed, for accurate inference.
    So homoscedastic means the variance is the same across all our observations (homo means “same” and scedastic means “scale”). Actually, I would add that we want not only the same scale but also the same shape.
    Heteroscedasticity may cause the Ordinary Least Squares to have biased variances of the parameters/coefficients (not biased parameters themselves), and that would lead to biased inference.
    Non-linear transformation, e.g. log-transformation, can also improve the homoscedasticity.

  • Algorithms that use distance metrics, such as SVM, KNN, K-Means, etc., obviously benefit from feature scaling, which helps preventing one feature dominating over others. For example, Euclidean distance is sensitive to the magnitude of the feature vectors.

Otherwise, scaling also helps the models to converge faster. For Gradient Descent, it can help preventing to be stuck in local optima for non-convex function.

We have (what I hope to be) a nice example in scikit-learn to show the problem of heteroscedasticity and asymmetry: Quantile regression — scikit-learn 1.0.2 documentation

As mentioned by @qdpham, OLS might provide a biased estimator and using a quantile regression might allow getting a more robust estimator.