Cv = KFold(n_splits=10...) vs cv = 10

I was wondering what is the difference between setting cv = KFold(n_splits=10, shuffle=True, random_state=seed) and simply stating cv = 10 inside cross_val_score?

I did cv = 10 for Question 5 and got GB outperforming RF each of the 10 times. I also notice the balanced accuracy scores were much closer than the answer’s.

Here is what I did:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

gradientboost = GradientBoostingClassifier(n_estimators = 300)
randomforest = RandomForestClassifier(n_estimators = 300)

gb_mean_list = []
rf_mean_list = []

for i in range(0, 10):
    gb_score = cross_val_score(gradientboost, 
                                data, target,
                                cv = 10,
                                scoring = "balanced_accuracy",
                                n_jobs = 2)
    
    gb_mean_list.append(gb_score.mean())
    
    rf_score = cross_val_score(randomforest, 
                                data, target,
                                cv = 10,
                                scoring = "balanced_accuracy",
                                n_jobs = 2)
    
    rf_mean_list.append(rf_score.mean())

It also turns out that using cv = 10 produces a lower balanced accuracy of ~0.53 whereas the answer had balanced accuracies of around 0.59 - 0.61.

Because of this, I am not sure why Question 6 says HistGradientBoostingClassifier performs the best, as ~0.58 is lower than what was achieved in the previous question by both GB and RF?

cv=10 is equivalent to KFold(n_splits=10) in regression and StratifiedKFold(n_splits=10) in classification.

shuffle=True will shuffle the dataset before to splits.

Yep we are in the noise of the score distribution. We have to rethink some of the questions of the quizzes that might change depending of the randomness.

Thank you for this explanation. I read the doc for KFold, but to make it clearer: does it mean that KFold, without shuffle relies on the “natural randomness” of the dataset ?

Rather than “natural randomness of the data”, one should better say the “natural arbitrariness of the ordering of the data”: depending of how the data was collected, the order can either be deterministic (e.g. ordered by event time for instance) or by some arbitrary process that could be considered random in one way or another.

2 Likes