Scaling test data on the statistics computed on the train data

suforraxi · 25 October 2022 17:56

Hi all,

I was wondering what are the potential drawback (if any) in scaling the test data with the transformer fitted on the train data?

Usually the train set is way bigger than the test data, so probably their estimated statistics (mean/std) are closer to the population compared to the test data. However if the split is 50 - 50, can it be that we are not properly scaling the test data (like there is some bias due to the statistics)? What are the implications?

Thanks,
Matteo

ogrisel · 25 October 2022 18:52

The generic rule is that if you want to evaluate the quality of a prediction pipeline on some test data, you should not use any of the test data to adjust any tunable parameter of any component of the pipeline, otherwise you might be cheating (data leak) without realizing it.