Imbalance male female samples

MarinaQT · 23 February 2022 22:12

Hello Authors
in the data checks above we saw that there is an important imbalance on the data collection concerning the number of male female samples.
But this point is not commented after the check.

ogrisel · 24 February 2022 09:20

Indeed. Something that could be done once we start to fit pipelines on such a dataset would be to decompose performance metrics (e.g. accuracy) per sub-group (e.g. males, females, younger, middle aged, older groups, country or origin (US vs non-US), “race” label …) and see if the prediction error is significantly worse for a given subgroup to detect a potential fairness problems if we were to deploy a system that would use automated decisions based on the predictions of this model. fairlearn is a useful open source tool to conduct such a fairness analysis and potentially put in place mitigation strategies if the original model decisions have been detected as systematically unfair for a given sub-group.

We should probably add a note in the notebook to hint that this dataset can potentially lead to fairness problems because of this imbalance if it were used to train a component of an automated decision system deployed in a real life setting.

ogrisel · 10 March 2022 09:42

A note was added: Add comment on fairness of the adult_census imbalance by ArturoAmorQ · Pull Request #597 · INRIA/scikit-learn-mooc · GitHub