How to balance classifiers correctly

GiorgiKvinikadze · 18 March 2022 16:46

In general, when we working on classification tasks data is always imbalanced. Which data oversampling method is good or how does the model training change during this situations?

glemaitre58 · 20 March 2022 17:52

I would refer to the following example: Fitting model on imbalanced datasets and how to fight bias — Version 0.9.0

This said an issue when people try to deal with class imbalance is linked with the metric used to evaluate the model. Indeed, it might be better to evaluate such a model with the help of the Brier score or a metric that would take into account the full decision function instead of only the decision at a specific threshold.

Resampling method might only have an effect to move the decision threshold (to provide the hard prediction) and would in practice not change the decision function.

In terms of methods to use, I would only recommend random under-sampling or over-sampling with an additional ensemble on the top (however the computational cost increases by a lot). I would not recommend any fancy methods such as SMOTE or NearMiss since they would not scale and do not provide a significant improvement in practice.

malberti · 9 May 2022 15:00

Indeed, it might be better to evaluate such a model with […] a metric that would take into account the full decision function instead of only the decision at a specific threshold.

What do you mean by that? Could you detail a little more?

glemaitre58 · 9 May 2022 15:16

I meant specifically ROC curve and Precision-Recall curve.

malberti · 9 May 2022 15:40

Hence the ROC-AUC and AP respectively?

glemaitre58 · 9 May 2022 15:44

Yes exactly.