Question is missleading

miraunpajaro · 1 July 2021 21:17

In question 4 of the quiz (I adjoined it below for clarity) I have the following problem. The question is poorly phrased, it doesn’t say which preprocessors you should use.

If a OneHotEncoder is used the right answer should be: b) The statistical performance is slightly better ~0.72 (The actual score being 0.7215692275718409).

The score only becomes ~0.74 with the encoding ordinal encoding given in the solution which is not required in the statement of the question.

This is common among other questions, they question is no phrased precisely. And the answer is ambiguous, could be either b) or c) depending on your choice which is arbitrary from the statement.


Question 4 (1 point possible)
Instead of using only the numerical dataset you will now use the entire dataset available in the variable data data.

Create a preprocessor by dealing separately with the numerical and categorical columns. For the sake of simplicity, we will assume the following:

categorical columns can be selected if they have an object data type;
numerical columns can be selected if they do not have an object data type. It will be the complement of the numerical columns.
Do not optimize the max_depth parameter for this exercise.

Fix the random state of the tree by' passing the parameter random_state=0

Are the performance in terms of R² better by incorporating the categorical features in comparison with the previous tree with the optimal depth?

tbaranger · 1 July 2021 22:00

I had the exact same problem. I used OneHotEncoder and got about 72% accuracy.

Also, I would be curious to know why the OrdinalEncoder performs better in this particular case ?

miraunpajaro · 1 July 2021 22:12

That’s anyone’s guess I think. My guess is ordinal encoder produces easier to handle to data.

But yeah, it’s very annoying, is there any way to complain?

lesteve · 2 July 2021 09:44

Sorry about this, we knew it from our beta tests but designing quizzes is actually very hard. In particular we find that the questions with code asking “is this model better/worse” or giving an approximate performance are brittle, because people can write slightly different code than the one we had in mind, or even are subject to statistical fluctuations (random_state value for example).

Generally speaking you are more than encouraged to create a topic for each confusing question that you bump into. Please remember to be constructive in this kind of posts !

We will try to improve the situation for the next MOOC session or even this session if we think this is required and doable.

About why the OneHotEncoder performs worse in this case, maybe because it creates more variables, there is more overfitting (the default max_depth means the tree is very deep, i.e. splits are done until all leaf nodes contains a single class) ?

glemaitre58 · 5 July 2021 08:09

An important related point that you encounter in one of the first exercises of the MOOC: 📃 Solution for Exercise M1.05 — Scikit-learn course

You should not use a OneHotEncoder with a tree-based model.

ebattitude · 5 July 2021 12:55

Hello,
I had the same problem when preprocessing with OneHotEncoder.
That’s true that there exists some useful remarks about this in the first exercises of this Mooc.
But it should have been nice to remind us to have a look at module1’s exercises before doing this wrap-up quiz.
Suggestion : because preprocessor’s choice has a dramatic impact on model’s score, it should be interesting to create a kind of review / module specialy on this point.
Thanks,
Emmanuel

lesteve · 6 July 2021 04:57

Agreed we will definitely tackle this for the next MOOC session!

lesteve · 28 January 2022 16:48

A note about using OneHotEncoder has been added to the quiz