Q5 Depending on the preprocessor, the result falls out of the possible results

Eolindel · 25 June 2021 10:57

For question 5, depending on how the data is processed, we can fall out of the given range. Here is the code that I used :

Full code snippet

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)


categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
#categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
numerical_preprocessor =  SimpleImputer()

preprocessor = ColumnTransformer([
    ('OrdinalEncoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)])

model = make_pipeline(preprocessor, DecisionTreeRegressor(random_state=0))
cv_results = cross_validate(model, data, target,scoring="r2",return_train_score=True,
                            return_estimator=True, cv=10, n_jobs=2,error_score="raise")
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The first line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="most_frequent"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.756

The second line gives
categorical_preprocessor = make_pipeline(SimpleImputer(strategy="constant",fill_value="unknown"),OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
0.742

And the third one gives
categorical_preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"))
0.726

What I did was do the first solution (0.76), then see that the answer was not covered, then fell back to the hot encoder strategy (which is wrong as I saw afterwards reading the forum) but as the result corresponded, I gave that answer (0.72) which was wrong. I thought of the middle answer only afterwards.

Edit :
data.info() helps to see that the "most_frequent " strategy was probably not the best one. Maybe its use should be underlined as early as module 1 in “tabular data exploration”. to detect early how many values are missing.

glemaitre58 · 25 June 2021 12:09

Yep, we should revise this question. I thought that fixing the random_state would be enough for this session but indeed, changing the preprocessing will have an impact.

We could be more directive. Normally, one should not use the OneHotEncoder with tree-based model and instead use the OrdinalEncoder. However, I am not sure that we specify the imputation method.

radamel17 · 28 June 2021 17:21

Hi,

the choice between make_pipeline and Pipeline give two different test_score 0.72 vs. 0.74. What is the main reason to explain those small difference ?

Have a nice day.

glemaitre58 · 28 June 2021 17:39

Probably one of the step has a stochastic process and the random state was not set.

Franciscosanhueza · 30 June 2021 23:01

I made the same mistake because I used the one-hot-encoder, problably the instructions should be more clear.

AWNeto · 1 July 2021 00:56

Hello!
Why shouldn’t one use the OneHotEncoder with a tree-based model?
I’m curious to understand the rationale behind this principle.
Thanks in advance!

glemaitre58 · 1 July 2021 08:23

You can refer to the solution of the following exercise that you did in the first module: 📃 Solution for Exercise M1.05 — Scikit-learn course

In short, it is not improving the statistical performance of a model. However, it expands the number of features, and since a tree will have to go through each feature individually, the computational cost is much higher.

lesteve · 11 January 2022 16:44

This has been fixed, we have explicitly added use a OneHotEncoder in the question