Imbalance male female samples

Hello Authors
in the data checks above we saw that there is an important imbalance on the data collection concerning the number of male female samples.
But this point is not commented after the check.

2 Likes

Indeed. Something that could be done once we start to fit pipelines on such a dataset would be to decompose performance metrics (e.g. accuracy) per sub-group (e.g. males, females, younger, middle aged, older groups, country or origin (US vs non-US), “race” label …) and see if the prediction error is significantly worse for a given subgroup to detect a potential fairness problems if we were to deploy a system that would use automated decisions based on the predictions of this model. fairlearn is a useful open source tool to conduct such a fairness analysis and potentially put in place mitigation strategies if the original model decisions have been detected as systematically unfair for a given sub-group.

We should probably add a note in the notebook to hint that this dataset can potentially lead to fairness problems because of this imbalance if it were used to train a component of an automated decision system deployed in a real life setting.

1 Like

A note was added: Add comment on fairness of the adult_census imbalance by ArturoAmorQ · Pull Request #597 · INRIA/scikit-learn-mooc · GitHub