Question 1 - Pipeline Order

HanPet01 · 27 May 2021 13:35

In Question 1, i used the following code

model = make_pipeline(SimpleImputer(), StandardScaler(),
                      LinearRegression())

instead of

model = make_pipeline(StandardScaler(), SimpleImputer(),
                      LinearRegression())

i.e., I changed the order of SimpleImputer and StandardScaler. This had a massive impact on coef_ magnitude, but I do not quite understand why (what difference does it make to Impute before/after Scaling?). Can somebody explain?

Cheers!

glemaitre58 · 27 May 2021 14:51

You can have a look at the following topic: M1 Wrap-Up quiz Q5 SimpleImputer question - #3 by glemaitre58

HanPet01 · 27 May 2021 15:10

Thanks, that cleared things up!

lionelBoillot · 2 June 2021 17:20

I got the same issue but it seems to change the answer of the question I had, I guess… could you verify please ?

glemaitre58 · 2 June 2021 17:43

Yes but the right way here is to put the scaler first and then the imputer. Inversing might change the answer but doing so is wrong.

lionelBoillot · 2 June 2021 18:15

The default strategy of the Imputer is “mean”, so why it is not equivalent to use it before or after a Scaler please ?

glemaitre58 · 2 June 2021 19:33

The issue will be due to the scaling in this case:

import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

data_numerical = data[numerical_features]

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

model_imputer_first = make_pipeline(
    SimpleImputer(), StandardScaler(), LinearRegression()
).fit(data_numerical, target)
model_scaler_first = make_pipeline(
    StandardScaler(), SimpleImputer(), LinearRegression()
).fit(data_numerical, target)

We can check that the means are close:

import numpy as np

np.testing.assert_allclose(
    model_scaler_first[0].mean_,
    model_imputer_first[1].mean_
)

and then the scale:

np.testing.assert_allclose(
    model_scaler_first[0].scale_,
    model_imputer_first[1].scale_
)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-57-9b05a422bc3e> in <module>
----> 1 np.testing.assert_allclose(
      2     model_scaler_first[0].scale_,
      3     model_imputer_first[1].scale_
      4 )

    [... skipping hidden 1 frame]

/opt/conda/lib/python3.9/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    840                                 verbose=verbose, header=header,
    841                                 names=('x', 'y'), precision=precision)
--> 842             raise AssertionError(msg)
    843     except ValueError:
    844         import traceback

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 2 / 24 (8.33%)
Max absolute difference: 2.25816052
Max relative difference: 0.10256683
 x: array([2.427464e+01, 9.977846e+03, 1.810038e+02, 4.559419e+02,
       1.612640e+02, 4.417156e+02, 4.385551e+02, 3.864553e+02,
       4.363789e+02, 4.860643e+01, 5.253004e+02, 8.154986e-01,...
 y: array([2.201648e+01, 9.977846e+03, 1.805073e+02, 4.559419e+02,
       1.612640e+02, 4.417156e+02, 4.385551e+02, 3.864553e+02,
       4.363789e+02, 4.860643e+01, 5.253004e+02, 8.154986e-01,...

Here the scale values of the first and third features are slightly different.

So indeed, this little change will make that the LinearRegression will go sideways for numerical issue (which one of the reasons why we always advise to use a Ridge model ). I find it a pity that this is indeed changing completely the answers here since this is not the goal of the exercise. We should solve this issue and probably the easiest thing for this section is to be directive regarding the order of the transformer in the pipeline.

moaatt · 2 June 2021 22:20

I also ran into this issue as well, and it took a while to come here to figure out what went wrong. I think that this is an interesting case that should either be covered in the material or mentioned in the question.

glemaitre58 · 3 June 2021 07:29

It is somehow similar to what is presented in exercise M4.04: 📃 Solution for Exercise M4.04 — Scikit-learn course. This time, it is happening on non-synthetic data.

moaatt · 3 June 2021 12:37

I apologize but I cannot see much of a connection to exercise M4.04 has no scaling or imputation at all.

ZHANG_INSA_ROUEN · 7 June 2021 09:54

Hi, I also met this problem.
For me the explanation is quite clear.
The reason why I put the SimpleImputer before StandardScaler is the same as lionelBoillot.
Maybe it’s better to issue this problem in the question describe.

glemaitre58 · 7 June 2021 12:12

Fixed in FIX be more explicit regarding numerical pipeline order · INRIA/scikit-learn-mooc@f11a8e7 · GitHub

It requires an action to be changed in FUN but should be shortly available.

MarieCollin · 7 June 2021 12:46

It has been changed in FUN platform too.