I was wondering what is the difference between setting cv = KFold(n_splits=10, shuffle=True, random_state=seed)
and simply stating cv = 10
inside cross_val_score
?
I did cv = 10
for Question 5 and got GB outperforming RF each of the 10 times. I also notice the balanced accuracy scores were much closer than the answer’s.
Here is what I did:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
gradientboost = GradientBoostingClassifier(n_estimators = 300)
randomforest = RandomForestClassifier(n_estimators = 300)
gb_mean_list = []
rf_mean_list = []
for i in range(0, 10):
gb_score = cross_val_score(gradientboost,
data, target,
cv = 10,
scoring = "balanced_accuracy",
n_jobs = 2)
gb_mean_list.append(gb_score.mean())
rf_score = cross_val_score(randomforest,
data, target,
cv = 10,
scoring = "balanced_accuracy",
n_jobs = 2)
rf_mean_list.append(rf_score.mean())
It also turns out that using cv = 10 produces a lower balanced accuracy of ~0.53 whereas the answer had balanced accuracies of around 0.59 - 0.61.
Because of this, I am not sure why Question 6 says HistGradientBoostingClassifier performs the best, as ~0.58 is lower than what was achieved in the previous question by both GB and RF?