About Question 4

Hi,

Although I expected the answer “b” to be the right answer, I got wrong because my results gave 5/10 rather than 9/10 like yours.

The only difference (from my point of view) is that I compare the “all categories” model with a new numerical model with max_depth fixed at 7 rather than with the one of the previous question (as you do) for which max_depth was optimized each CV-fold.

Anyway, it sounds strange to me.
Do you have any idea/explanation about that ?

My code below (with output)

Thank you for your help
And thank you very much for this MOOC

M.

************ CODE ***************

numerical_columns_selector = selector(dtype_exclude=object)
numerical_columns = numerical_columns_selector(data)

categorical_preprocessor = 
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value =-1)


#here with your preprocessor. Same results as mine
preprocessor = make_column_transformer(
    (categorical_preprocessor, selector(dtype_include=object)),
    ("passthrough", selector(dtype_exclude=object)))


modelall = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth=7,random_state=0))
results_model_all = cross_validate(modelall, data, target, cv=10, return_estimator=True, n_jobs=2)

modelnum=DecisionTreeRegressor(max_depth=7,random_state=0)
results_model_num = cross_validate(modelnum, data[numerical_columns], target, cv=10, return_estimator=True, n_jobs=2)

print(
    "A tree model using both numerical and categorical features is better than a "
    "tree with optimal depth using only numerical features for "
    f"{sum(results_model_all['test_score'] > results_model_num['test_score'])} CV "
    "iterations out of 10 folds."
)

************ OUTPUT ***************
A tree model using both numerical and categorical features is better than a tree with optimal depth using only numerical features for 5 CV iterations out of 10 folds.

N.B
with random_state=1 : 5/10
with random_state=2 : 3/10 !!!

Hello,

The difference is that you select the numerical features using

numerical_columns = numerical_columns_selector(data)

whereas we use

numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

which is a subset of the features that are not an object data type (due to historical reasons inherited from session 1 of the MOOC). This means that the last bullet point in the instructions for Q4 should not be there, as it is misleading.

I am tagging this to be solved for session 3. Thanks for your feedback!

Hi,

Thank you for answering.
I should have checked the list size.

M.

Hi,

If numerical columns are selected with the given list i.e. numerical_features, text of the question is confusing as it suggest to select numerical columns with
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
numerical_columns_2 = numerical_columns_selector(data)

and then comparison

np.setdiff1d(numerical_columns_2, numerical_features)

leads to

array(['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'GarageYrBlt',
       'HalfBath', 'MSSubClass', 'MoSold', 'OverallCond', 'OverallQual',
       'YearBuilt', 'YearRemodAdd', 'YrSold'], dtype='<U13')

Sorry I did notice ti was already mentioned that last bullet point is misleading

I’ve the same problem , there is no specification about the fact that we had to select only the numerical columns given in the quiz .

This point is misleading : “numerical columns can be selected if they do not have an object data type. It will be the complement of the numerical columns”

So, i’ve a wrong answer (but a good one) regarding the data given to my model .