Remove "education" or "education-num"?

stevestreet · 16 February 2022 06:24

from the notebook:

In practice that means we can remove "education-num" without losing information.

Wouldn’t it be better to get rid of “education” instead?
It is easier to code the rank 14 > 9 than “Masters” > “HS-grad”, and the order has real meaning (i.e more years in education).

ogrisel · 16 February 2022 08:22

That’s also a possibility. Here we wanted to keep it simple by using a mapping where all string-valued features would be treated as nominal categorical variables (no a-priori assumed ordering or quantitative interpretation of the values) and all numerical encoded features would have a natural quantitative interpretation (e.g. adding values can have a meaning).

This choice is can be questioned and it’s perfectly fine to try either strategy or even not remove any feature at all and use the cross-validation score to decide which choice leads to the best predictive model.