Why do we encode categorical variables?

AnotherSailor · 18 February 2022 18:07

I may have missed the argument somewhere.
What is the advantage of encoding categorical variables as integers? I understand in cases when they are ordered, to allow distance calculations.
But when they are unordered, why does it optimise the processing ? I would have assumed that the models would have been given the ability to code these strings into representations that they can process. I think R works that way.

glemaitre58 · 18 February 2022 20:43

There is no model in scikit-learn that does this magic. R has a specific type of variable (i.e. “Factor”) that does not exist in Python. Pandas introduce quite recently a “categorical” data type that one day could be used similarly.

Therefore, we can only consider strings as just Python strings and thus numerical algorithm does not know what to do with them. Therefore, we are required to do an extra step to choose an encoding to transform strings to numerical values. In addition, we are required to choose the type of encoding to choose that depends of the assumption made on the type of underlying predictive models.