Quiz M1.02, Q4 - Question re StandardScaler

Steve90 · 24 May 2021 01:10

StandardScaler shifts & scales each feature so that the mean is 0, so I understand that the points should end up clustered around the origin.

StandardScaler also gives a unit Std Dev. So how does one know what the range of values will be? The answer seems to imply that it will be [-3, 3]? This surely can’t be the case as the range in y is approx. [-2, 2] in diagram B … ???

From my little understanding of StandardScaler in SciKitLearn docs, it seems it’s purpose is to produce a standar normal distribution.

So my question is, can we really guess/predict what range of values StandardScalar will produce? If so how?

Thanks for clarifying!

Marc_In_Singapore · 24 May 2021 03:17

Where do you get this?

The answer seems to imply that it will be [-3, 3]

For a standard normal distribution, 68%, 95%, 99.7% of the data will fall within 1, 2, and 3 standard deviations (and will be centered on 0).

Steve90 · 24 May 2021 04:23

Explanation of Solution: b)

By default, the StandardScaler transformer transforms the data by centering each feature around 0.0 on average and by scaling the resulting values so that they have a standard deviation of 1.0 on the training set. In practice, this means that each feature will have values ranging from -3 to 3 as depicted on the data transformed by preprocessing B.

Marc_In_Singapore · 24 May 2021 05:57

I see. I didn’t look at the solution.

@glemaitre58, I think the text should be changed from:

In practice, this means that each feature will have values ranging from -3 to 3 as depicted on the data transformed by preprocessing X.

to:

In practice, this means that each feature will have most of its values (95%) ranging from -2 to 2 as depicted on the data transformed by preprocessing X.

glemaitre58 · 24 May 2021 07:45

[-3, 3] was intended for containing more or less 99.7% of the data (+/-3 sigma). We will improve the answer to be more regarding the meaning of these bounds.

glemaitre58 · 24 May 2021 08:42

The answer has been corrected (we still need to synchronize with FUN). Thanks for the feedback.