Standard scalar only on positive features?

geogeo14000 · 2 June 2021 18:03

Hello,

I am not sure to understand an aassertion of the quizz about SStandardScaler : “transforms positive-only features into negative or positive values”
I mean it is logical in the sense that if we substract the mean from negative values the effect will be worst than no transformation at all. But when we look at the plots of the previous question, we see that the y coordinate of the data can be negative. So I don’t really understand why such data have been used in the previous question with the plot ?

Thank you very much !

Geoffrey

glemaitre58 · 2 June 2021 19:00

Yes but the Q.4 and Q. 5 are dissociated. In Q. 5, we don’t refer to the data of Q.4. We are asking general statement regarding StandardScaler.

geogeo14000 · 2 June 2021 19:19

Thank you for your answer.

I’m sorry I still have difficulties to understand. Q5 ask a general statement but if it is true in general it should be in particular. In Q4 we talk about the same StandardScaler, right ? So how could it work on values not necessarily positive (cf. y coordinate can be negative) and then in Q5 we say it is ONLY for positive features ? or am I totally misunderstanding the whole thing ? (btw I’m french sorry if my english is approximate

Geoffrey

glemaitre58 · 2 June 2021 19:48

Sorry but I don’t understand the question. I will try to exemplify with a simple coding example. let’s take a single feature with only positive value and scale it with a StandardScaler:

import numpy as np
from sklearn.preprocessing import StandardScaler

data = np.reshape([1, 2, 3, 4, 5], (5, 1))
StandardScaler().fit_transform(data)

We will get the following output:

array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

We subtract the mean and dividing it by the standard deviation. Thus, a feature with only a positive value can get negative values after scaling.

Performing this operation on a feature containing negative and positive values will have a similar effect.

data = np.reshape([-2, -1, 0, -1, 2], (5, 1))
StandardScaler().fit_transform(data)

array([[-1.17953565],
       [-0.44232587],
       [ 0.29488391],
       [-0.44232587],
       [ 1.76930347]])

Subtracting the mean is the equivalent of centering the data to the center of the feature space. Divide by the standard deviation will dilate or contract the values around this center.

geogeo14000 · 2 June 2021 20:09

Thank you very much for the examples.

In your second example, indeed it “works” with negative values, but here I think the mean is “added” for negative numbers and not subtracted I think; otherwise -2 could not become -1.179…, right ?

Then I really don’t understand why in the quizz Q5 it is said (as a correct answer) that the StandardScaler “transforms positive-only features” whereas your last example shows it can also work with non positive features. But maybe I misunderstood what is meant by positive-only feature…

Anyway thank you very much for the detailled answer.

Geoffrey

glemaitre58 · 2 June 2021 20:53

No the mean is really subtracted. Let’s do it by hand only with NumPy:

mean = np.mean(data)
scale = np.std(data)
(data - mean) / scale

array([[-1.17953565],
       [-0.44232587],
       [ 0.29488391],
       [-0.44232587],
       [ 1.76930347]])

glemaitre58 · 2 June 2021 20:56

OK now I understand your question. I think the barrier here is the English formulation. “positive-only features” does not mean “only positive features”. You should read it as: “StandardScaler transforms features containing only positive values into features containing both negative and positive values.”

geogeo14000 · 2 June 2021 21:10

Ah ok haha thank you I finally get it !

Yes you’re right for the mean but in your last example the mean is negative so when subtracted we have - - so we get + that’s why -2 becomes “bigger” but for the last value which is 2 I think the mean is just added and here as it is a negative mean it means it’s a subtraction. that’s why we have 2 and end up with 1.76 I think, otherwise the number should be bigger (I think)

Thank you so much I’m glad I finally got it !