I am getting better results for the numerical model

christonikos · 3 November 2022 21:10

It’s a bit weird, but my results are better with the numerical data only, unlike the results posted here. Can I share my solution, or shall I send it somewhere privately?
Thanks

ogrisel · 4 November 2022 09:42

I think it’s ok to share your solution if it leads to an interesting discussion where everybody can learn from it

christonikos · 4 November 2022 10:00

Great, so here is my solution.

# modules
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector



# use the selector to get the categorical data
categorical_selector = selector(dtype_include =object)
numerical_selector = selector(dtype_exclude =object)

# select the data based on the selector
categorical_data = data[categorical_selector(data)]
numerical_data = data[numerical_selector(data)]

# Define two different preprocessors
categorical_processor = OneHotEncoder(handle_unknown="ignore")
numerical_processor = make_pipeline(SimpleImputer(strategy='median'),
                                    StandardScaler())
# Combine them into a full transformer
preprocessor = ColumnTransformer([('num',numerical_processor,
                                   numerical_data.columns),
                                 ('cat',categorical_processor, 
                                  categorical_data.columns),
                                 ])
# Numerical model
numerical_model = make_pipeline(numerical_processor,
                      LogisticRegression(max_iter=1_000))
numerical_scores = cross_validate(numerical_model, numerical_data, target, cv=10)
print("The mean cross-validation accuracy is: "
      f"{numerical_scores['test_score'].mean():.3f} +/- {numerical_scores['test_score'].std():.3f}")

# Full model
full_model = make_pipeline(preprocessor,
                      LogisticRegression(max_iter=1_000))
full_scores = cross_validate(full_model, data, target, cv=10)
print("The mean cross-validation accuracy is: "
      f"{full_scores['test_score'].mean():.3f} +/- {full_scores['test_score'].std():.3f}")

ArturoAmorQ · 4 November 2022 13:14

Instead of using a make_column_selector, you need to use the numerical_features defined in Question 5 as

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

This is a subset of the features that are not an object data type (due to historical reasons inherited from session 1 of the MOOC).

christonikos · 4 November 2022 13:30

Thanks, Arturo, therefore the increase in model performance is not due to the combined feature type, but rather due to the small predictive contribution (compared to others) of these specific features. Is my reading of this correct?

ArturoAmorQ · 4 November 2022 15:12

Let’s try to differentiate 4 possible models:

the model using the subset defined by numerical_features and numerical values only;
the model using the make_column_selector and numerical values only;
the model using the subset defined by numerical_features and both numerical and categorical values;
the model using the make_column_selector and both numerical and categorical values.

I obtain the following accuracies:

0.89 +/- 0.01
0.92 +/- 0.01
0.92 +/- 0.03
0.92 +/- 0.02

Notice that models 2 - 4 overlap in terms of their standard deviations. None of them is particularly better. Also notice that the difference between 1) and 2) can be obtained by running the following code:

numerical_selector = selector(dtype_exclude=object)
difference = list(set(numerical_selector(data)) - set(numerical_features))
print(difference)

the ouput will be

['YearBuilt', 'OverallCond', 'YearRemodAdd', 'MSSubClass',
 'MoSold', 'FullBath', 'BsmtFullBath', 'BsmtHalfBath',
 'GarageYrBlt', 'YrSold', 'HalfBath', 'OverallQual']

The conclusion here is that adding the columns difference has about the same effect as adding the categorical features. In any case keep in mind that the aim of this exercise is to compare performance across folds and not the mean accuracy.

ArturoAmorQ · 4 November 2022 15:17

As a complement to my answer, the meanings of the variable in difference are the following:

OverallCond: Rates the overall condition of the house
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
MSSubClass: Identifies the type of dwelling involved in the sale
MoSold: Month Sold
Full Bath: Full bathrooms above grade
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
GarageYrBlt: Year garage was built
YrSold: Year Sold
HalfBath: Half baths above grade
OverallQual: Rates the overall material and finish of the house

Notice that the number of bathrooms and half bathrooms are usually correlated with the area of the house LotArea; the year of the sale is somewhat constant; and some other small issues that could lead to overfitting.

christonikos · 4 November 2022 15:57

Once again, great work, Arturo. Thanks so much.