Add formatting to decimal output

reshama · 14 June 2021 19:57

In section:
Section: Fitting a scikit-learn model on numerical data.

sub-section:

Preprocessing for numerical features

can we add this option:
rounding numbers for output of .describe() function

pd.options.display.float_format = "{:,.3f}".format

glemaitre58 · 14 June 2021 20:23

I am not in favour of adding this code. The reason is that we made an effort to remove any unnecessary code that tweaks pandas, matplotlib, NumPy behaviour because it could be intimidating for beginners.

In materials where we would put a strong knowledge in NumPy / pandas as a requirement, I would certainly take the advice.

reshama · 14 June 2021 20:27

Got it.
I’ve personally found it difficult to read numbers when they are output in scientific notation like that.
But, I understand the reasoning here.

glemaitre58 · 14 June 2021 20:38

Another possibility would be to have an rc file without to have to write code.
For this MOOC session it might be difficult but we should certainly look at this option for the next session.

reshama · 14 June 2021 21:20

Not sure what an rc file is. But, I think that if I can’t easily know what the count is from the .describe() function, then it’s not so useful. Even though it is the default, I cannot read those numbers and make sense of them.

reshama · 14 June 2021 23:00

Reference (documentation here).

Default

Formatted

glemaitre58 · 15 June 2021 06:31

Uhm there is something weird here still. Executing the notebook on FUN without tweaking anything, I am getting the following output by default.

        	age 	    capital-gain 	capital-loss 	hours-per-week
count 	36631.000000 	36631.000000 	36631.000000 	36631.000000
mean 	38.642352 	    1087.077721 	89.665311 	    4 0.431247
std 	13.725748 	    7522.692939 	407.110175 	    12.423952
min 	17.000000 	    0.000000 	    0.000000 	    1.000000
25% 	28.000000 	    0.000000 	    0.000000 	    40.000000
50% 	37.000000 	    0.000000 	    0.000000 	    40.000000
75% 	48.000000 	    0.000000 	    0.000000 	    45.000000
max 	90.000000 	    99999.000000 	4356.000000 	99.000000

glemaitre58 · 15 June 2021 06:37

@reshama Did you change anything in the notebook that could explain the change?

reshama · 15 June 2021 11:10

No, when I first ran the notebook, it gave me scientific notation.
Not sure why we are getting a different view, by default.

reshama · 21 June 2021 00:36

I assume I can only change my own notebook, right? I shouldn’t be able to change the main notebook in the MOOC.

glemaitre58 · 21 June 2021 06:53

From the FUN interface, you are indeed only modifying your own notebook. We are committing changes here: GitHub - INRIA/scikit-learn-mooc: scikit-learn-mooc

The python_scripts contains all the lectures indeed. Apart of the quizzes, we are making changing that will have a direct effect on FUN, once one is reverting the notebook to original.

reshama · 22 June 2021 12:51

In section “Preprocessing for Numerical Features”, after resetting the notebook to original and running the notebook from the very top cell:

this code data_train.describe(), I see this:

this code data_train_scaled.describe(), shows this:

I wonder why the formatting is different for the different cell codes.

glemaitre58 · 22 June 2021 13:07

Because we have a mean really close to zero but not being zero for numerical error reason after scaling. So pandas switch to engineering notation to be able to show us the small numbers (e.g. -1.2…e-16)

lesteve · 2 February 2022 12:38

tracked in Add formatting to decimal output · Issue #556 · INRIA/scikit-learn-mooc · GitHub