Is comparing fold-to-fold legitimate or just expedient?

When we compare cross-validation test scores fold-to-fold is that a realistic comparison of the performance of two different models? For example, one model may win out on a comparison fold-to-fold, but the other model might actually have more winning scores overall and those scores just didn’t happen to line up in the fold-to-fold comparison.

Is it certainly not the most robust way to compare two models. In a previous session of the mooc we asked to compare the distance between score distributions for two models, i.e. if the distance between mean cross-validated scores was larger than their standard deviations (see this discussion in the mooc repo).
This is a more informative way to compare an arbitrary number of models, but it appeared to be confusing for an introductory course and differences on the samplings of such distributions may lead to different results in different setups.

1 Like