Dealing with imbalanced data

First how to deal with the imbalanced data(here the class)?
Secondly, is it a problem to have a feature vector imbalanced? Or only the label is required to be balanced ?

Dealing with imbalanced data (for the target variable) is an important topic that we would like to address better in a future version of the mooc. There are plenty of fancy methods implemented in imbalanced-learn. However be aware that more sophisticated methods (e.g. SMOTE over-sampling and its derivatives) can be very computationally intensive and do not always yield better results. Here are a few personal recommendations:

  • make sure that you have at least a few thousand examples for the minority class (the class with the least number of examples labeled with it): do no expect miracles if the minority class has only a few hundred examples. If the minority class has too few examples, then maybe machine learning is not the right tool to tackle your problem.

  • try BalancedBaggingClassifier / Regressor to wrap any traditional estimator instance into a model that is better able to deal with imbalanced data. This will automatically subsample the over-represented classes to train an ensemble of models on more balanced variations of the original training set and make them models in the ensemble vote to get the final prediction.

  • always evaluate your model by computing the performance metrics (e.g. precision, recall, f1 score, balanced accuracy…) on the test set on using the original class balance: you are free to rebalance the training set any way you like but the test should never be rebalanced: otherwise the value of the performance metrics would not represent the “real world” performance of your model because in the “real world” your model will make not make predictions on rebalanced data.

When you mention "imbalanced feature vector ", I assume you mean input categorical variables with imbalanced categories. Here I would just recommend to consider collapsing rare features (e.g. less than 5 or 10 occurrences in the training set) into related yet more popular features (depending on your expert knowledge of the features) or into a single “rare” category created for the purpose.

Otherwise, if you keeping the rare categorical values without collapsing, they can generate many spurious dimensions if you one-hot encode this categorical variable and this can cause the model to overfit.

3 Likes

Thanks ! very detailed answer.

Thanks for this Reply. also had similar question. But know I have a doubt about my practice. When using pipeline that includes Randomoversampling step; Am I applying this alson on my test set ?

Let’s say I have
model = pipeline(preprocessing, randomundersampler, classifier)
model.filt(xtrain,ytrain)
ypred=model.predict(xtest)

In this case : is randomundersampler also applied on X_test? How to manage then when using pipeline (same concern if I use one of the imblearn “balanced classifier”.

Sorry if imple & silly question.

Thanks

I assume you use imblearn.under_sampling.RandomUnderSampler and imblearn.pipeline.Pipeline provided by the imbalanced-learn project.

If this is the case, then all is fine because this Pipeline implementation is smart enough to only call the fit_resample method of the sampler at fit time and do nothing at predict time. Read its docstring in the online API documentation linked above for more details.

The sklearn.pipeline.Pipeline implementation does not support resampling. It assumes that the transformers always generate one transformed output sample for one input sample.

Thanks , yes when sampling I am usin imblearn pipeline and not scikit learn one. Clear now. By the way I find it really difficult to deal with imbalanced datasets (like attrition rate for example). I was not aware of imblearn “balanced” classifier existence. First check seems not to provide better results than sampling + classical classifier. But I wlill keep searching.

Thanks again for help! Bye