Density argument in plt.hist

teorems · 29 June 2021 09:59

I have tried to understand how this works, but i can’t understand it. In the exercise notebook you used density=True to plot the frequency distribution so each histogram cluster sums to 1. We see on the y axis a range from 0 to 1. When we use the same argument in the test, it doesn’t seem to produce the same result in the plot, as we on the y axis we have a range from 0 to 800.

Can you also explain in the detail how the ‘stratified’ strategy of the DummyClassifier works? Is it like for every observation we have 0.75 to predict the most frequent value? Thanks a lot

lesteve · 29 June 2021 13:33

About your first point, density=True does normalize by the area of the graph not the sum of the histogram counts see https://stackoverflow.com/a/59074477 for example. I think keeping counts would be less confusing for this plot, we should probably look at this, I am going to tag this as priority-nice-to-hav.

About your second point, the short answer is yes: stratified predicts random classes while preserving the training set class proportions, this is mentioned in the doc for example.

lesteve · 2 February 2022 12:39

PR in Explain density argument in plt.hist by ArturoAmorQ · Pull Request #557 · INRIA/scikit-learn-mooc · GitHub