Wrap-up quiz4 - Question10

Manuela_Moor · 1 July 2021 14:47

Hi, first of all I want to say “thank you” for this wonderful course. I am really excited to learn new things and this course it such a good opportunity to actually do so

The reason I am opening this topic is the following. I have this issue with Question 10 from the quiz, i.e. I do not agree with the answer labelled as the correct one.
After giving my answer I cross-checked my code with the code given in the explanation and I get the exact same numbers. Without wanting to write down any concrete numbers or answers, but strictly mathematically speaking, the answer is not correct.
Maybe I am also missing something. If this is the case, please tell me what I did wrong (maybe there is a private chat where numbers can be written or so).
If this is indeed right, I want to emphazise that I don’t need the points, so don’t worry on that front. I just thought maybe there exist other people with the same issue as mine. Then it would maybe help to slightly modify the answer by something found in the last line of the explanation?

glemaitre58 · 1 July 2021 20:29

Could you provide full details with the code and the answer that is problematic. We will censor the message once that we find the issue to address the problem. It will be easier in this manner

Manuela_Moor · 2 July 2021 07:14

Edit: we removed the parts of this message that would give away the quizz solution

Hi Guillaume

Of course, i will gladly share my code and numbers

Ok, so here is what I got. After loading the data like suggested at the beginning of Q8:

Then loading:

Givies me this result:

Looking at the code in the explanation, it is exactly the same, with some different naming conventions giving the same result when I execute it.

Now doing the same with dummy:

Gives me the output:

Double-checking with the code you provide in the explanation (thinking that maybe using the strategy=“most_frequent” would change the result), I still get the same result, i.e.
Thanks again for the great course and also for taking the time to answer to all of these questions (not only mine I mean). You guys and gals are awesome
Best
Manu

lesteve · 2 July 2021 08:45

Thanks a lot for your code, I think this is another case where our quizz is not robust to variations (random_state or slightly different code for example).

I think we should try to tackle this for the second version.

Also I edited your previous post to remove the code which is the solution (but the pedagogic team can still see it ).

lesteve · 2 July 2021 08:46

Voici le post de départ (enlevé parce qu’il contient la réponse à la question)

Hi Guillaume

Of course, i will gladly share my code and numbers

Ok, so here is what I got. After loading the data like suggested at the beginning of Q8:

import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])

Then loading:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier

And doing:

model = make_pipeline(StandardScaler(), LogisticRegression())
res = cross_validate(model, data, target, cv=10, return_estimator=True)
print(res['test_score'].mean())

Givies me this result:

0.7998445658834604

Looking at the code in the explanation, it is exactly the same, with some different naming conventions giving the same result when I execute it.

Now doing the same with dummy:

model_d = make_pipeline(StandardScaler(), DummyClassifier())
res_d = cross_validate(model_d, data, target, cv=10, return_estimator=True)
print(res_d['test_score'].mean(), res_d['test_score'].mean()+0.04)

Gives me the output:

0.7607182352166999 0.8007182352166999

Double-checking with the code you provide in the explanation (thinking that maybe using the strategy=“most_frequent” would change the result), I still get the same result, i.e.

0.7607182352166999.

If these numbers are indeed correct then the logistic regression is not better by an amount of 0.04 or more. I know it is petty, but being a mathematician I take boundaries seriously

As the number is indeed very close to an increase of 0.04 I think a tilde in the answer would solve the insecurity if other people have the same thoughts as I was having. I.e.
Better/Worse than a dummy classifier with an increase/decrease of ~0.04
This would have led me to ckeck answer c) as the tilde suggests that the boundary of 0.04 does not have to be taken 100% seriously.

Thanks again for the great course and also for taking the time to answer to all of these questions (not only mine I mean). You guys and gals are awesome
Best
Manu

Manuela_Moor · 2 July 2021 09:42

Hi Loïc

Thank you for your reply.
Just one last thing; I don’t know how strict you are with deduction. If you want to be sure about the censoring of the previous message (in a way that the answer cannot be deduced), maybe you want to censor also the part starting with
"If these numbers are indeed correct… " plus the complete following segment.
Best
Manu

lesteve · 2 July 2021 09:47

Thanks I removed more stuff. We are not very strict I think but at the same time it’s better if people try to answer the question by themselves .

lesteve · 28 January 2022 16:47

The linear model quiz is being reworked and should be more stable