Q6 query

Poorna · 30 March 2022 06:01

Hello,
For Q6, i am getting only 4 cases where the model with only numeric columns is performing worse than the model with all columns. Please find the code snippet below. Can you please help.

Below is the model with all columns:

from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer([(‘numerical’,StandardScaler(),num_data.columns),(‘categorical’,OneHotEncoder(handle_unknown=‘ignore’),cat_columns.columns)])

pipe2=Pipeline([(‘preprocessing’,ct),(‘logistic’,LogisticRegression())])
cv1=cross_validate(pipe2,data,target,cv=10)

sum(cv[‘test_score’]<cv1[‘test_score’])

ArturoAmorQ · 30 March 2022 08:37

How is num_data defined in your code? Are you using the provided numerical_features defined by

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

or are you using a column selector?

Poorna · 30 March 2022 10:16

Thanks for the reply, I am using the column selector using dtypes.

ArturoAmorQ · 30 March 2022 12:03

For Question 6 we intended the student to use the same numerical_features as defined in Question 5 (those from my previous response). This is a subset of the features that are not an object data type (due to historical reasons inherited from session 1 of the MOOC).

Poorna · 30 March 2022 13:53

ok, thanks for the clarification.