Necessity of preprocessing test data

cpumarfrohberg · 6 March 2022 15:55

hi,

when preprocessing data, and in the context of the current notebook (i.e. scaling data), is it not necessary to preprocess also test data?

If yes, is it right to assume that .transform(X_test) transforms features of X-test based on mean and variance learned for each feature in .fit(X_train)?

Also, and assuming that train_test_split() would be applied with a 50-50 distribution (unrealistic, I admit) between X_train and X_test: would applying parameters learned from X_train affect transformation of X_test?

thx in advance for your clarification!

glemaitre58 · 6 March 2022 17:24

If you are scaling the training then you should as well scale the testing dataset.

Exactly, you should use the learned statistic from the training set. You can imagine a potential system that would be in production. If this system provides you with a single sample then there is no way to compute any mean or std. dev. Therefore, you assume that the new sample belongs to the same distribution as the training sample and apply the same transformation.

The only thing that we can say is that we use an empirical mean and std. dev. from the training set hoping that there are no shift between the training and testing sets.