Overfiting

What should be the variance (score difference) between the train and the test set for overfitting?

The variance is not the difference between train score and test score. Which paragraph led you to this interpretation? Maybe it’s not clear.

In statistics, the variance of a random variable is the expected value of squared difference between the variable and its own expected value. We tried to phrase the MOOC to give methodological intuitions without introducing strong prerequisites on a formal training in statistics though.

If I rephrase the question as “how large should the difference between train score and test score be to consider that a model overfits?” then the answer is “it depends”.

It depends on the scale of the performance metrics:

  • if it’s accuracy, for instance, then the maximum score is 100% and the minimum score is the chance level, which depends on the number of datapoints in the majority class(es) and the total number of data points.

  • if you observe a model with a train accuracy of 78% and a test accuracy of 75%, you could say that the model is overfitting a bit because test accuracy is a few percent lower than the train accuracy, but this is probably not the main problem: the train accuracy itself is not great, so that means that the underfitting problem is much more of a problem than the overfitting problem of the model.

  • also note that, for some tasks, given the features at hand, it is impossible to predict the value of the outcome variable with 100% accuracy. For instance try to predict whether it will rain tomorrow with only the information about the rain of the past few days as input features. In this case we say that there is a large “irreducible error” (because we lack good predictive features). In this case, to tell if the difference between train and test error represents a significant overfitting or not depends on how large it is to the irreducible error (the complement of the test accuracy of the best possible model). However unfortunately, this quantity is not possible to estimate with standard scikit-learn tools in practical settings.

thank you for your answer