Different scale for categorical and numerical data?

Mirzon · 19 May 2021 10:33

Hello,

I think I missed something. In the lecture “Using numerical and categorical variables together”, we are using this pipeline:

The OneHotEncoder handling categorical data outputs zeroes and ones, while the StandardScaler handling numerical data outputs values centred around zero. This means that categorical and numerical values are not in the same range.
Why didn’t we use a Normaliser instead for the numerical data? Or put the StandardScaler after the funnel, right before the LogisticRegression?

Maybe that LogisticRegression is not affected by scaling at all, but if that’s the case, why bother with a StandardScaler for numerical data in the first place?

Thanks in advance for your answers .

anmarphy · 19 May 2021 17:14

Hi Mirzon, when you normalize basically you are transforming the data in order that each column have a norm of 1. However, standardizing is a process to center the data and remove the scale, guaranteeing that the algorithms converts faster. For more details check the documentation: 6.3. Preprocessing data — scikit-learn 0.24.2 documentation

Mirzon · 19 May 2021 17:28

Ok, sorry, the Normaliser was wrong as an example, I should have said MinMaxScaler.

But did you get the main point I was trying to make? That the ranges are different?

Thank you for your answer, by the way.

ogrisel · 19 May 2021 17:47

They are not exactly on the same scale but approximately on the same scale.

With standard scaler, most of the numerical values will be between -3 and 3 and with one hot encoding all the values will be either 0 or 1.

For logistic regression, having features approximately on the same scale is important for several reasons:

having some features in a range 10000 times larger than another feature’s range can cause numerical stability problems or numerical convergence problems (ill-conditioning) depending on the solver used
logistic regression in scikit-learn is regularized by default (this will be explained in the module on linear models). By default all the coefficients are regularized the same way. If features have widely differing scales, the regularization parameter will typically only impact the features with a small scale (because there optimal coefficient would have to be larger compared to features with a large scale to achieve a similar effect).

Mirzon · 19 May 2021 18:00

Ok, so approximation is good enough. Thank you for your answer!