Hi,
Although I expected the answer “b” to be the right answer, I got wrong because my results gave 5/10 rather than 9/10 like yours.
The only difference (from my point of view) is that I compare the “all categories” model with a new numerical model with max_depth fixed at 7 rather than with the one of the previous question (as you do) for which max_depth was optimized each CV-fold.
Anyway, it sounds strange to me.
Do you have any idea/explanation about that ?
My code below (with output)
Thank you for your help
And thank you very much for this MOOC
M.
************ CODE ***************
numerical_columns_selector = selector(dtype_exclude=object)
numerical_columns = numerical_columns_selector(data)
categorical_preprocessor =
OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value =-1)
#here with your preprocessor. Same results as mine
preprocessor = make_column_transformer(
(categorical_preprocessor, selector(dtype_include=object)),
("passthrough", selector(dtype_exclude=object)))
modelall = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth=7,random_state=0))
results_model_all = cross_validate(modelall, data, target, cv=10, return_estimator=True, n_jobs=2)
modelnum=DecisionTreeRegressor(max_depth=7,random_state=0)
results_model_num = cross_validate(modelnum, data[numerical_columns], target, cv=10, return_estimator=True, n_jobs=2)
print(
"A tree model using both numerical and categorical features is better than a "
"tree with optimal depth using only numerical features for "
f"{sum(results_model_all['test_score'] > results_model_num['test_score'])} CV "
"iterations out of 10 folds."
)
************ OUTPUT ***************
A tree model using both numerical and categorical features is better than a tree with optimal depth using only numerical features for 5 CV iterations out of 10 folds.
N.B
with random_state=1 : 5/10
with random_state=2 : 3/10 !!!