Q5 Depending on the preprocessor, the result falls out of the possible results

For question 5, depending on how the data is processed, we can fall out of the given range. Here is the code that I used :

Full code snippet
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)


categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
numerical_preprocessor =  SimpleImputer()

preprocessor = ColumnTransformer([
    ('OrdinalEncoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)])

model = make_pipeline(preprocessor, DecisionTreeRegressor(random_state=0))
cv_results = cross_validate(model, data, target,scoring="r2",return_train_score=True,
                            return_estimator=True, cv=10, n_jobs=2,error_score="raise")
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The first line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.756

The second line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.742

And the third one gives
categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
0.726

What I did was do the first solution (0.76), then see that the answer was not covered, then fell back to the hot encoder strategy (which is wrong as I saw afterwards reading the forum) but as the result corresponded, I gave that answer (0.72) which was wrong. I thought of the middle answer only afterwards.

Edit :
data.info() helps to see that the "most_frequent " strategy was probably not the best one. Maybe its use should be underlined as early as module 1 in “tabular data exploration”. to detect early how many values are missing.

2 Likes

Yep, we should revise this question. I thought that fixing the random_state would be enough for this session but indeed, changing the preprocessing will have an impact.

We could be more directive. Normally, one should not use the OneHotEncoder with tree-based model and instead use the OrdinalEncoder. However, I am not sure that we specify the imputation method.

1 Like

Hi,

the choice between make_pipeline and Pipeline give two different test_score 0.72 vs. 0.74. What is the main reason to explain those small difference ?

Have a nice day.

Probably one of the step has a stochastic process and the random state was not set.

I made the same mistake because I used the one-hot-encoder, problably the instructions should be more clear.

1 Like

Hello!
Why shouldn’t one use the OneHotEncoder with a tree-based model?
I’m curious to understand the rationale behind this principle.
Thanks in advance!

You can refer to the solution of the following exercise that you did in the first module: 📃 Solution for Exercise M1.05 — Scikit-learn course

In short, it is not improving the statistical performance of a model. However, it expands the number of features, and since a tree will have to go through each feature individually, the computational cost is much higher.

This has been fixed, we have explicitly added use a OneHotEncoder in the question