Q6 used code in solution, got wrong answer!

andrewjohnlowe · 9 May 2022 12:55

The result I got for Q6 was:

The model using all features is performing better 5 times out of 10 than the model using only numerical features.

I used the code that was supplied in the solution in Q5 and Q6, verbatim. Literally cut and pasted it. Consequently I got Q6 wrong, and I don’t know why. There is no opportunity for me to learn from this.

Here’s my complete code:

import pandas as pd
ames_housing = pd.read_csv("../datasets/ames_housing_no_missing.csv")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

from sklearn.compose import make_column_selector as selector

num = selector(dtype_exclude=object)
cat = selector(dtype_include=object)

numerical_features = num(data)
categorical_features = cat(data)

data_numerical = data[numerical_features]

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

from sklearn import set_config
set_config(display='diagram')

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

data_numerical = data[numerical_features]

model = make_pipeline(StandardScaler(), LogisticRegression())
cv_results_num = cross_validate(model, data_numerical, target, cv=10)
cv_results_num["test_score"].mean()

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

categorical_features = data.columns.difference(numerical_features)

categorical_processor = OneHotEncoder(handle_unknown="ignore")
numerical_processor = StandardScaler()

preprocessor = make_column_transformer(
    (categorical_processor, categorical_features),
    (numerical_processor, numerical_features),
)
model = make_pipeline(preprocessor, LogisticRegression(max_iter=1_000))
cv_results_all = cross_validate(model, data, target, cv=10)
cv_results_all["test_score"].mean()

cv_results_num["test_score"] > cv_results_all["test_score"]

print("The model using all features is performing better "
    f"{sum(cv_results_num['test_score'] < cv_results_all['test_score'])} "
  "times out of 10 than the model using only numerical features.")

glemaitre58 · 9 May 2022 15:35

The correction does not define the following in this manner:

num = selector(dtype_exclude=object)
cat = selector(dtype_include=object)

numerical_features = num(data)
categorical_features = cat(data)

We provide the numerical features before question 5:

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

Using these features, the answers are in line with the outputs.

andrewjohnlowe · 9 May 2022 17:24

Thanks for your reply! OK, I see my mistake: it wasn’t clear to me that numerical_features is not the set of all the numerical features, but only a subset of them. The previous notebooks demonstrated how to separate the features into numerical and categorical features, and so I used what I had learned previously to do the same, as useful practise. The name numerical_features is perhaps somewhat misleading in the context of what has been shown previously. At this stage, we haven’t done any nontrivial feature selection, so I had no reason to believe that this had been done to define the members of numerical_features. Yes, I could have counted their members and compared it to my answer in Q3, but I didn’t.

Feature request
Q5 would benefit from the following text replacement, or similar:

From: We consider the following numerical columns:
To: We consider the following subset of numerical columns:

This makes it explicit that not all numerical columns are used. Thanks!

GvT8 · 16 May 2022 19:56

I made the same mistake and found this was really unclear in the instructions. It really makes a big difference, when using the columns specified I got the correct answer afterwards.