Use of power_transform

g_nzuedzang · 5 July 2021 19:34

Good morning,
I need some explanations about this remark in scikit-learn documentation about “power_transform”: " A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set."

Does it mean we should apply box-cox for instance only on training data? What if we want to make the target distribution more gaussian, should we do that only on the training data which seems not correct I think?

lesteve · 6 July 2021 05:03

Does it mean we should apply box-cox for instance only on training data?

You should learn the box-cox transform on the training data and then apply it on the test data. Note this is true for any pre-processing in general.

There is a bit more info about data leakage here: 10. Common pitfalls and recommended practices — scikit-learn 0.24.2 documentation. Also 📃 Solution for Exercise 01 — Scikit-learn course may be useful, it gives a particularly striking example of data leakage for feature selection.

g_nzuedzang · 7 July 2021 18:10

Thank you very much for the time you took to answer my question.