I was wondering what is the difference between setting cv = KFold(n_splits=10, shuffle=True, random_state=seed) and simply stating cv = 10 inside cross_val_score?
I did cv = 10 for Question 5 and got GB outperforming RF each of the 10 times. I also notice the balanced accuracy scores were much closer than the answer’s.
Here is what I did:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
gradientboost = GradientBoostingClassifier(n_estimators = 300)
randomforest = RandomForestClassifier(n_estimators = 300)
gb_mean_list = []
rf_mean_list = []
for i in range(0, 10):
    gb_score = cross_val_score(gradientboost, 
                                data, target,
                                cv = 10,
                                scoring = "balanced_accuracy",
                                n_jobs = 2)
    
    gb_mean_list.append(gb_score.mean())
    
    rf_score = cross_val_score(randomforest, 
                                data, target,
                                cv = 10,
                                scoring = "balanced_accuracy",
                                n_jobs = 2)
    
    rf_mean_list.append(rf_score.mean())
It also turns out that using cv = 10 produces a lower balanced accuracy of ~0.53 whereas the answer had balanced accuracies of around 0.59 - 0.61.
Because of this, I am not sure why Question 6 says HistGradientBoostingClassifier performs the best, as ~0.58 is lower than what was achieved in the previous question by both GB and RF?