M1 Wrap-Up quiz Q5 SimpleImputer question

Did I miss a prior lesson about SimpleImputer? Seems to be the first mention of it here…

Anyway, my question is: in a pipeline, should StandardScalar() come first (always?) or SimpleImputer()? If so, why or when not so?

My cross-validation accuracy seems to be the same either way.

I had the same question but, thinking more about it, the order does not matter.

For example, if you consider the data: 1, 2, NaN, NaN, NaN, 6, its mean is 3.

And if you consider the data in which you replace NaN by the mean (3), the mean is still 3: 1, 2, 3, 3, 3, 6.

The normalized data Z will still have a mean of 0 and a standard deviation of 1 from the formulas of the mean and variance.

Let’s simplify things for the moment and assume that scikit-learn transformers omit missing values when computing statistics that is indeed the case for the StandardScaler

Let’s discuss the pipeline make_pipeline(SimpleImputer(strategy="constant", fill_value=10_000), StandardScaler()). Here, I specifically impute with a large value to illustrate the drawback of using this approach. Indeed, we impute with an extreme value. StandardScaler will use these imputed values to compute the mean and standard deviation. It could be an issue depending on the underlying distribution of the feature.

In the contrary, make_pipeline(StandardScaler(), SimpleImputer(strategy="constant", fill_value=10_000)), has the advantage that feature values will be scaled omitting missing values and then imputed. Thus, the strategy used for imputing will not have an impact on the statistics computation.
Thus, when possible, it will be a better practice.

However, not all transformers in scikit-learn omit missing values (to my knowledge, all scalers do). So you might need to use the first case sometimes. However, we are currently working in scikit-learn such that missing values can be ignored in the preprocessing for the transformers that do not support it yet.

5 Likes

I forgot to precise that if your imputation is consistent with the feature distribution then it will have little impact on the scaling and thus you will not observe any changes in the test scores. It is just that sometimes it can go sideways and my explanation explained what would be the reason.