The issue will be due to the scaling in this case:
import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
numerical_features = [
"LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
"GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
"GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]
data_numerical = data[numerical_features]
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
model_imputer_first = make_pipeline(
SimpleImputer(), StandardScaler(), LinearRegression()
).fit(data_numerical, target)
model_scaler_first = make_pipeline(
StandardScaler(), SimpleImputer(), LinearRegression()
).fit(data_numerical, target)
We can check that the means are close:
import numpy as np
np.testing.assert_allclose(
model_scaler_first[0].mean_,
model_imputer_first[1].mean_
)
and then the scale:
np.testing.assert_allclose(
model_scaler_first[0].scale_,
model_imputer_first[1].scale_
)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-57-9b05a422bc3e> in <module>
----> 1 np.testing.assert_allclose(
2 model_scaler_first[0].scale_,
3 model_imputer_first[1].scale_
4 )
[... skipping hidden 1 frame]
/opt/conda/lib/python3.9/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
840 verbose=verbose, header=header,
841 names=('x', 'y'), precision=precision)
--> 842 raise AssertionError(msg)
843 except ValueError:
844 import traceback
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 2 / 24 (8.33%)
Max absolute difference: 2.25816052
Max relative difference: 0.10256683
x: array([2.427464e+01, 9.977846e+03, 1.810038e+02, 4.559419e+02,
1.612640e+02, 4.417156e+02, 4.385551e+02, 3.864553e+02,
4.363789e+02, 4.860643e+01, 5.253004e+02, 8.154986e-01,...
y: array([2.201648e+01, 9.977846e+03, 1.805073e+02, 4.559419e+02,
1.612640e+02, 4.417156e+02, 4.385551e+02, 3.864553e+02,
4.363789e+02, 4.860643e+01, 5.253004e+02, 8.154986e-01,...
Here the scale values of the first and third features are slightly different.
So indeed, this little change will make that the LinearRegression
will go sideways for numerical issue (which one of the reasons why we always advise to use a Ridge
model
). I find it a pity that this is indeed changing completely the answers here since this is not the goal of the exercise. We should solve this issue and probably the easiest thing for this section is to be directive regarding the order of the transformer in the pipeline.