Q6 query

Hello,
For Q6, i am getting only 4 cases where the model with only numeric columns is performing worse than the model with all columns. Please find the code snippet below. Can you please help.

Below is the model with all columns:

from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer([(‘numerical’,StandardScaler(),num_data.columns),(‘categorical’,OneHotEncoder(handle_unknown=‘ignore’),cat_columns.columns)])

pipe2=Pipeline([(‘preprocessing’,ct),(‘logistic’,LogisticRegression())])
cv1=cross_validate(pipe2,data,target,cv=10)

sum(cv[‘test_score’]<cv1[‘test_score’])

How is num_data defined in your code? Are you using the provided numerical_features defined by

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

or are you using a column selector?

Thanks for the reply, I am using the column selector using dtypes.

For Question 6 we intended the student to use the same numerical_features as defined in Question 5 (those from my previous response). This is a subset of the features that are not an object data type (due to historical reasons inherited from session 1 of the MOOC).

ok, thanks for the clarification.