Wrap-up quiz - 5 : OneHotEncoder vs OrdinalEncoder

MD59MD · 10 July 2021 19:09

Edited question for secrecy:

MD59MD asked about whether the pipeline should be written with ordinal or one-hot encoding.

ogrisel · 11 July 2021 16:21

The question is ambiguous, indeed. We should have mentioned we expected you to use ordinal encoding for this part of the exercise. I will delete the question to avoid giving away the answer of this questions to other participants.

ogrisel · 11 July 2021 16:24

Copy of the original post:

from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

categorical_processor = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore")
)
numerical_processor = SimpleImputer()


preprocessor = make_column_transformer(
    (categorical_processor, selector(dtype_include=object)),
    (numerical_processor, selector(dtype_exclude=object))
)
tree = make_pipeline(preprocessor, DecisionTreeRegressor(random_state=0))
cv_results = cross_validate(
    tree, data, target, cv=10, return_estimator=True, n_jobs=2
)
cv_results["test_score"].mean()

OneHotEncoder yields to a score of 0.72 whereas OrdinalEncoder gives a score of 0.74.

lesteve · 6 January 2022 13:16

gitlab MR: Sign in · GitLab

Edit done in the repo.