Final quiz M6 Question 5

fhi62 · 20 June 2021 18:42

As per solution, I understand that one model is always better but when running my code with RepeatedKFold (maybe not the right way at all ?)

the difference is less than 0.01 so it seems very similar. Can you clarify what is wrong in my logic ?

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1)

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

rf = RandomForestClassifier(n_estimators=300, n_jobs=2)
gbf = GradientBoostingClassifier(n_estimators=300)

scores_rf = cross_val_score(rf, data, target, scoring='balanced_accuracy', cv=cv, n_jobs=2)
print(scores_rf.mean())

scores_gbf = cross_val_score(gbf, data, target, scoring='balanced_accuracy', cv=cv, n_jobs=2)
print(scores_gbf.mean())

glemaitre58 · 20 June 2021 20:03

Actually, the exercise does not require a RepeatedKFold but instead repeat 10 times a KFold cross-validation and check how many times the mean test score of one classifier is better the other.

This is a bit different from using the RepeatedKFold that only give you a single mean.

glemaitre58 · 20 June 2021 20:05

I however see that we have a small mistake in the correction: we should use different KFold with shuffling and pass different random_state. It is not change the answer.

glemaitre58 · 22 June 2021 08:23

I made the changes here: https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/commit/8fab1e2517a49d6b1a8a12fadb4a4adebe1478a1

@lfarhi @MarieCollin Could you make the changes in FUN.

lfarhi · 22 June 2021 08:51

It’s also fixed in FUN