CV SCORE - Basic mistake

blooridian · 4 July 2021 12:46

Hello,

I know I am late in this course…
Anyway while completing the WrapUp Quiz M4 Question 10, I made an error.

cv_results = cross_validate(model, data, target, 
                            cv=10, return_train_score=True,
                                     return_estimator=True)

I calculated the score by :

aa=[cv_results['estimator'][i][-1].score(data,target) for i in range(10)]
print(f"mean score value = {np.mean(aa)}")

The result is completly different when using cv_results[“test_score”].mean()

This difference is not visible whan applyed on the Dummy classifier.

So my question is : what is doing estimator.score(Data,Target)?

In anycase reading directly the cv_results dict is far better solution I admit… *
Thanks a lot

ThomasLoock · 4 July 2021 15:36

The problem is that you are scoring each estimator with the full dataset ( Data, Target ).
But each estimator was not trained on full dataset but on a different training and testing subset and therefor the test_score for each estimator is different. Otherwise all 10 test_scores would be the same which obviously would make no sense.

glemaitre58 · 5 July 2021 10:02

As an addition remark to the answer of @ThomasLoock, you will get potentially always good prediction on the training set and it will corrupt your estimation of the statistical performance.