Why does OneHotEncoder default to float64?

Given that the values are going to be encoded to either a 0 or a 1, and nothing else, wouldn’t it better to use int8?

To frame it in another way, what advantages does encoding to float64 offer?

2 Likes

This is because most downstream machine learning models will perform floating-point based arithmetic operations on this data. For instance LogisticRegression and Ridge both minimize a continuous function derived from the input feature values and expected labels. Converting to uint8 to reconvert to float64 right after that would be a waste of memory and CPU.

Note that more and more estimators in scikit-learn can support float32 in addition to float64 inputs, so in the future we might decide to get OneHotEncoder to default to float32 instead.

2 Likes

Thank you for the reply.