Q6 wrap quiz

echidne · 9 June 2021 14:06

Hi,
I’m a bit puzzled by th the solution of the Q6 of the wrapup quizz. You say :

We see that the gap between train and test scores is large. In addition, the average score of on the training sets is good while the average scores on the testing sets is really bad. They are the signs of a overfitting model.

and that are the results I obtained :
wrapup Q6 result

So I’m wondering what is your definition for a large gap. As you can see the difference is about 0.17. In your mind what is a small or a large difference?

In the same idea what is a good score for a training set? If like here my model still do mistake on 30% of my training set do I have to consider it’s a good score??
And if my model is only about 70% accurate on my training set can I really consider than 53% is really bad for the test set??

Sorry but I have problem with subjective metrics as small/large or good/bad when they are no definition of them in the cases they are used.

Thank for your help.

ps : That is the code I used :

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), KNeighborsClassifier())
model_scores = cross_validate(model, data, target, cv = 10, scoring = "balanced_accuracy", return_train_score=True)
scores = pd.DataFrame(model_scores) 
print(f"mean test_scores is {scores['test_score'].mean():.3f} +/- {scores['test_score'].std():.3f}")
print(f"mean train_scores is {scores['train_score'].mean():.3f} +/- {scores['train_score'].std():.3f}")

In the details the scores I obtained are :
train test scores wrap up quizz

glemaitre58 · 9 June 2021 14:48

Specifically to know if a model underfit, generalize or overfit, we look at the distribution of the scores. You can consider a large gap because the two distributions do not overlap much. a 50% accuracy classifier is as good/bad as guessing (supposing that you have a binary problem with balanced classes).

Here we compare two scores and that’s why we consider the training score relatively good compare to the testing score.

In the case that you want to know if a model is good, this is another story: the application will tell you if a model performing at 70% accuracy is good enough.

echidne · 9 June 2021 17:38

Yes I agree with that statement but its quite different to what you wrote in the solution.
But I’m still wondering about the large gap. You say you considered the difference as large since the distribution of the 2 scores are not overlapping much, but let us imagine a totaly unlikely test_score of 0.68 +/-0.01 and a train_score of 0.72 +/-0.01. These distributions do not overlap much too but can we still say there is a large gap between both?

Thank to help my old brain

glemaitre58 · 10 June 2021 08:00

I would not consider it as a large gap in practice because we are dealing with a difference of 4% +/- 1% difference. In the original example, we had a gap of 20% accuracy.

Then, if you want to have a quantification of what is large, one could always make a statistical test between the score distributions and get a statistic on the difference between these distributions. One of the resons, why we don’t go this path, is that it is quite complex and not super friendly as an introduction.

I am linking to a comment on @ogrisel that gives an additional example in scikit-learn that uses such statistical testing Question 2 - suggestion - #4 by ogrisel

echidne · 10 June 2021 10:03

I understand than doing statistical tests are beyond the scope of that course, but perhaps you could say something like that:

We see that the gap between train and test scores is around 20% that is quite large. In addition, the average score of on the training sets is good enough (70%) while the average scores on the testing sets is really bad since 50% accuracy is as the result of a flipping coin. They are the signs of a overfitting model.

In absence of statistical tests I think it’s better to use less affirmative statements or to use some quick explainations to be in scope with a friendly introduction.

glemaitre58 · 10 June 2021 10:08

Point taken. We will improve the correction soonish.

Thanks for the proposal, this is really helpful to improve the content.

lesteve · 10 June 2021 10:34

For completeness I have mark this an unsolved to remember to tackle this useful feed-back from Q6 wrap quizz - #5 by echidne

AntoreepJana · 26 June 2021 15:01

@glemaitre58
70% training score is a good metric relative to 52% testing score. However, 70% standalone is not a good score for training performance.

Please help me with this. In my opinion, it is equally correct to state that the model is not understanding the training data well. Or is underfitting. Or it is overfitting & underfitting both at the same time.

glemaitre58 · 27 June 2021 13:43

It depends. You are probably thinking that 100% is indeed the best performance that you can get and thus 70% is far from this optimal. However, you are forgetting that we cannot always get 100% anyway due to the Bayes error rate.

You cannot overfit or underfit at the same time.

AntoreepJana · 27 June 2021 14:34

Thanks. Understood. Bayes error can be 30 % for the dataset. In that case, it’s a good model.

Yup. It can’t overfit and underfit at the same time. Got it.

svizcay · 1 July 2021 11:49

I’m also confused regarding that question.

At first, seeing a score of 70% for the train and 53% for the test, I considered them as “kind of” bad scores, so my first thought was the model was underfitting.

One can argue that is overfitting because there is a gap between the train and test score (train score being higher) but that was also the case in a previous exercise in a notebook.

In that exercise, the notebook said that for max_depth < 10, the model is underfitted (that includes the section max_depth [5,10[ ) in which we can see a similar behaviour and even with similar values (not so good score for both and with a gap between train and test).

ArturoAmorQ · 31 January 2022 15:37

Solved in Sign in · GitLab