Given that the values are going to be encoded to either a 0 or a 1, and nothing else, wouldn’t it better to use int8?
To frame it in another way, what advantages does encoding to float64 offer?
Given that the values are going to be encoded to either a 0 or a 1, and nothing else, wouldn’t it better to use int8?
To frame it in another way, what advantages does encoding to float64 offer?
This is because most downstream machine learning models will perform floating-point based arithmetic operations on this data. For instance LogisticRegression
and Ridge
both minimize a continuous function derived from the input feature values and expected labels. Converting to uint8 to reconvert to float64 right after that would be a waste of memory and CPU.
Note that more and more estimators in scikit-learn can support float32 in addition to float64 inputs, so in the future we might decide to get OneHotEncoder to default to float32 instead.
Thank you for the reply.