Why drop education-num?

ilejarza · 17 February 2022 11:49

Since we’re trying our hand to model with numeric variables only, wouldn’t be useful to include “education-num” in the set of independent variables? Or is there any good reason to drop it I’m not aware off?

ArturoAmorQ · 17 February 2022 14:46

The commented line inside the code cell refers to the notebook “First look at our dataset” where we mention that the “education-num” column carries the same information as the column "education".

For example, education-num=2 is equivalent to education='1st-4th'. In practice that means we can remove education-num without losing information. Note that having redundant (or highly correlated) columns can be a problem for machine learning algorithms.

glemaitre58 · 17 February 2022 14:55

You can also check some relevant discussion there: Remove "education" or "education-num"? - #2 by ogrisel

ilejarza · 18 February 2022 13:01

I read, and understood, the reasons why, having two variables which are one and the same (one categorically coded; the other, nummerically coded), we must drop one of them (not adding to model predictive power, collinearity …) So far, so good.
But it’s in the context of this particular exercise that I don’t get it, I’ll try to make myself clear. Siince we are trying to fit a classfier to numerical variables, we drop categorical variables, education=‘1st-4th’ included.
But, if we are droping this variable, why we do not get education-num in the set of numerical variables used to predict “class”. We’re letting out information that could be useful to our purpose,
This decissión is the one I don’t get. Thanks beforehand.

glemaitre58 · 18 February 2022 20:18

There is a tricky thing here. It is not because a feature contains only numbers that this is necessarily a numerical feature. Here, we consider the education feature to be made of categories and thus being a categorical feature that requires to be treated as a categorical variable. We better define and show the type of processing required to preprocess these types of data in the next section of the MOOC.

ilejarza · 20 February 2022 15:48

I’ll be patient, then. Thanks!

Timotron · 23 February 2022 10:17

I kept education-num and of course got rid of education, first I got an error executing the fit() function :

Preformatted text`/opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

https://scikit-learn.org/stable/modules/preprocessing.html*
Please also refer to the documentation for alternative solver options:
1.1. Linear Models — scikit-learn 1.0.2 documentation
n_iter_i = _check_optimize_result(*

I entered a bigger and bigger max_iter argument into :
model = LogisticRegression(max_iter = 1000)

Eventually it worked but I gained only a small amount of accuracy to reach about 81.8% regardless of max_iter order of magnitude

ogrisel · 23 February 2022 13:15

We will see a better way to to use the information about education when training models that combine both numerical and categorical features in the subsequent notebooks.