Categorical feature with integer values

malberti · 23 February 2022 22:08

Suppose we have a categorical feature taking integer values without a meaningful order information.

Suppose also that we are interested in linear models. Even though such feature takes numerical values, should it be one-hot-encoded?

If so, I guess it is quite a nightmare to recognize such a feature among others in a new, hence unknown, data set.

ogrisel · 24 February 2022 09:30

Yes, for linear models, such integer-coded yet fundamentally nominal features should be one-hot encoded.

For tree-based models, that wouldn’t matter that much (different inductive bias).

If so, I guess it is quite a nightmare to recognize such a feature among others in a new, hence unknown, data set.

In a real life setting, the datascientist is not handed over a fixed, finalized dataset as a CSV file with just a bunch of numbers in a table. The features have to be constructed by querying the logs or database records for some operational business or administrative or scientific process and and can inspect the meaning of each of them by talking to the operators of the data generating process. It’s actually the role of the datascientist to build an understanding of how this data is generated and collected to make sure that it’s relevant to build an automated decision system that aims to optimize a give business, public policy or scientific quantitative objective for instance.