F1-Scoring and Dummy Clasifier

christonikos · 27 October 2022 08:19

On the official scikit-learn website, one reads that:

constant always predicts a constant label that is provided by the user. A major motivation of this method is F1-scoring, when the positive class is in the minority.

Could you please elaborate on the bold-part? That would be really helpful. Thanks

ArturoAmorQ · 27 October 2022 11:27

In the “Classification metrics” in Module 7 we will talk more in depth about class imbalance and how some scoring techniques can become misleading if the positive class is the minority, for example, if the positive class is people carrying a rare disease.

The F1 score accounts slightly better for prevalence than metrics such as accuracy, but still it can be misleading if the prevalence is high. A solution could be to exchange the two classes, or rather to benchmark two classes with respect to the baseline defined by the DummyClassifier. Then the F1 score becomes informative again.

Ngofgeu · 30 November 2022 16:39

I have never seen on any sklearn documentation / MOOC the usage for the imbalanced-earn library to solve imbalanced datasets issues.

Does this library inherits from sklearn? Would you guys recommand using it to solve imbalanced datasets ?

ArturoAmorQ · 30 November 2022 17:07

Does this library inherits from sklearn?

Yes, it’s based on the scikit-learn API.

Would you guys recommend using it to solve imbalanced datasets ?

You can try! But keep in mind it’s efficacy may depend on the dataset and user case.