Standard scalar fit to training set or entire dataset

My understanding is that the standard scalar is fit to the training dataset only, then the transform is applied to the entire dataset? Is this correct, or is the standard scalar fit to the entire dataset?

If it is fit to only the training set, does this change the relationship between observations within a variable within the entire dataset?

Alternatively if it is applied to the entire dataset, I assume this would result in leakage because the test set is not completely unseen when fitting a model. Is this problematic?

Many thanks for your help,
Ashlea

1 Like

I’m not one of the organizers, but I’ll try to answer anyway, errors and omissions excepted :slightly_smiling_face: A transform is an estimator (an object with a fit method), and it’s fit to a dataset as explained here. In other words, the model state (which for StandardScaler is given by the the arrays
scaler.mean_ and scaler.mean_, i.e., a mean and a scale for each column/feature of the dataset) is learned based on a dataset. This dataset is the training set, as you can see:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data_train)

So there’s no data leakage: you learn the model state (the column means and scales) on the training set. Then, when you use predict on a test set, the means and scales are not recomputed from data, but you simply call the transform method with a fixed state, learned from the training set. For example:

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(data_train, target_train)

predicted_target = model.predict(data_test)

As explained in the lecture, inside predict

The method transform of each transformer (here a single transformer) is called to preprocess the data. Note that there is no need to call the fit method for these transformers because we are using the internal model states computed when calling model.fit . The preprocessed data is then provided to the predictor that will output the predicted target by calling its method predict .

The StandardScaler internal state is thus learned on the training set, and applied in the same way to the training and the test set. So, at the same time, there’s no data leakage from train to test, and there’s no change in “the relationship between observations within a variable” (whathever that may exactly mean) because exactly the same transform is applied to all observations of a dataset. For all observations of a feature X, I subtract exactly the same mean, and divide by exactly the same scale, irrespective of whether the observation belongs to the training set or the test set.

2 Likes

I also understood that this is what is happening: The mean and standard deviation are “learned” from the training set, and then used to apply the transformation on any other set (for instance, testing data).

It may be a desirable property that the transformation is “the same” for every data to which it is applied.
However, intuitively, this somehow feels “wrong” to me: I want to standardize any data set (for instance, testing data) using its own mean and standard deviation.

Using the training set’s parameters does not guarantee that the transformed testing data will be “correctly” standardized. By “correctly” I mean that I expect the transformed data to have mean zero and standard deviation one.

I would be interested in further comments about this.

Preprocessing a dataset as you say might lead to incorrect preprocessing. I attached a figure that should the improper scaling:

In practice, it might not be as dramatic as on the example because with many samples, the mean of both the testing and training set will be close. However, this is still a bad idea :slight_smile:

2 Likes

I’m not one of the organizers, but I’ll try to answer anyway, errors and omissions excepted

@AndreaPie just for completeness, this is great to see users answering other users questions, so don’t hesitate to keep doing it if you feel like it!

You don’t have to be one of the course creator to provide useful answers :wink:. In this particular case, I think your answer is very thorough!

2 Likes

Given that scaling the test set with the mean and std of the train set leads to wrong results, is there a way to “force” the pipeline predict method to call .transform_fit instead of .fit?

I think that you misinterpreted the comment. Scaling the testing data with the testing statistics leads to the wrong results. Scaling with the training statistics is the right answer here. Scikit-learn pipeline will do the right thing for you.

1 Like

Thanks! I’m flattered that one of the organizers found my answer to be good :relaxed:

Thanks for these graphical examples.

It made me realize that we want the (transformed) test data to belong to the same “space” as the (transformed) training data:
If one testing data point is at a given position in the original data space relative to other points, it should still be in the same relative position in the transformed data space.

2 Likes