ColumnTransformer constructor

metssye · 10 June 2021 20:26

Hi,

First of all, thank you very much for this excellent course!

My question is somewhat naive.
In the “One-hot encoding of categorical variables”, I put

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore", sparse=False)
preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns),
    ('numerical', numerical_preprocessor, numerical_columns)]
)

In the correction, in the line preprocessor, you don’t include the numerical preprocessor, why?

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

What I understood is that the numerical columns are left as they are, so they are not scaled. Am I wrong?

For me, the 2 instances of the ColumnTransformer are not the same, so the model and the results may differ (even in this case, my results are close to the expected ones).

I hope that my question/observation is clear? (I’m not so fluent in english).

Thank you in advance,

glemaitre58 · 11 June 2021 08:11

The question is really clear

I will just recall the aim of the exercise: we wanted to see the impact of the OneHotEncoder and the StandardScaler on the results. The choice that we did in the exercise was to modify one part at a time:

only modify the OneHotEncder and let the numerical as-is;
only modify the scaling and keeping the OrdinalEncoder as in the previous course.

The results might vary a little if you did not follow exactly this correction but this is not incorrect. You should probably come to the same conclusion that is:

OneHotEncoder does not improve much the statistical performance but it is much more costly with a tree-based model;
scaling the data does not any effect on the statistical performance and computational performance (scaling is pretty ship compare to the rest of the processing).

metssye · 14 June 2021 13:40

Thank you Guilaume for your reply.
I think I understood the purpose of the exercise
My question was a bit “peripheral”. What I understood for this section: 2 parameters can influence the model: the encoder and the scaling. For this exercise, the aim is to see the influence of the encoder. So, testing 2 configurations like: (encoder 1, scaling) and (encoder 2, scaling).
The fact that you “squizzed” the scaling in the 2nd test, disturbed me!

Maybe I’m complicating things

Thank you again for your support and your pedagogy!

Naïma (alias Metssye)