Question 4 in WQ5 - variations in numerical result

camille-anne · 17 June 2021 09:10

Hello,
When I repeat the implementation, the score I get oscillates typically between 0.71 and 0.73. Occasionally I obtained 0.74 but it happened only once. These variations finally induced me to choose the “wrong” answer.

Here are my steps below , could you indicate me any mistake ? Thanks in advance.

scaler_imputer_transformer = make_pipeline(StandardScaler(), SimpleImputer(strategy="mean"))

imputer_ordinal_transformer = make_pipeline(SimpleImputer(strategy="constant",fill_value="missing"),
                                           OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1))

preprocessor = ColumnTransformer(transformers=[
	("cat-preprocessor", imputer_ordinal_transformer, categorical_columns),
    	("num-preprocessor", scaler_imputer_transformer, numerical_features)
])

model = make_pipeline(preprocessor, DecisionTreeRegressor())

cv_results = cross_validate(model, data, target, cv=10)

glemaitre58 · 17 June 2021 10:07

Did you compute the mean of the test score?

glemaitre58 · 17 June 2021 10:18

You are right, removing the random_state in the tree can show that there is a large variability there that make the question obsolete.

We should rework this question then.

camille-anne · 17 June 2021 12:40

Thanks for your reply.

camille-anne · 17 June 2021 12:41

Yes, with:
scores = cv_results[“test_score”]
print(f"The tree accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

teorems · 20 June 2021 16:23

I obtained a mean of 0.76 from the cv scores without scaling the numerical features and applying a onehotencoder instead of an ordinalencoder.

glemaitre58 · 20 June 2021 17:35

If you are dealing with trees, you should not use OneHotEncoder as specified in the first lecture regarding categorical encoding.

As I earlier mentioned, using the proper pipeline can still lead to the behaviour stated by @camille-anne, just because we did not care enough to look at the variation due to randomness. We need to modify the question to take into account this issue.

glemaitre58 · 22 June 2021 08:28

I added a new sentence such that we fix the random_state to get the expected results. However, we should review this question.

See FIX add to fix the random state · INRIA/scikit-learn-mooc@8f9a080 · GitHub

glemaitre58 · 22 June 2021 08:29

@lfarhi @MarieCollin Could you update FUN with https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/commit/8367feacb0fc2986216945264b075c5e35f1da77

lfarhi · 22 June 2021 08:44

It’s also fixed in FUN platform.

jrahn · 25 June 2021 06:40

Additionally, I found that the order of the preprocessors matters. In my solution, the ColumnTransformer was created with numerical columns first and categorical columns second. This results in cv test mean of ~0.72. If you switch the order and ColumnTransformer is created with categorical columns first and numerical columns second, then the cv test mean is ~0.74.

If you look at the answer/explanation of the question, you can replicate this result by just switching the order of the two lines below preprocessor = ....

Edit: I see this was already discussed here Extracting cross_validate scoring metric & order of ColumnTransformer - #3 by tanh_lines