Final quiz M6 Question 5

As per solution, I understand that one model is always better but when running my code with RepeatedKFold (maybe not the right way at all ?)

the difference is less than 0.01 so it seems very similar. Can you clarify what is wrong in my logic ?

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1)

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

rf = RandomForestClassifier(n_estimators=300, n_jobs=2)
gbf = GradientBoostingClassifier(n_estimators=300)

scores_rf = cross_val_score(rf, data, target, scoring='balanced_accuracy', cv=cv, n_jobs=2)
print(scores_rf.mean())

scores_gbf = cross_val_score(gbf, data, target, scoring='balanced_accuracy', cv=cv, n_jobs=2)
print(scores_gbf.mean())

Actually, the exercise does not require a RepeatedKFold but instead repeat 10 times a KFold cross-validation and check how many times the mean test score of one classifier is better the other.

This is a bit different from using the RepeatedKFold that only give you a single mean.

I however see that we have a small mistake in the correction: we should use different KFold with shuffling and pass different random_state. It is not change the answer.

I made the changes here: https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/commit/8fab1e2517a49d6b1a8a12fadb4a4adebe1478a1

@lfarhi @MarieCollin Could you make the changes in FUN.

It’s also fixed in FUN

1 Like