Object to categorical data type

Is there any advantage to changing the dtype of objects to categorical data type?

Pandas would provide additional features for this specific data type: Categorical data — pandas 1.4.1 documentation

In the future, we can imagine using this specific data type to automatically distinguish categorical features from others in a transformer that could look like the SuperVectorizer in dirty-cat: dirty_cat.SuperVectorizer — — Dirty cat

Hello @glemaitre58 i thought object dtype is under the group of categorical features of which we have to encode to change it to numerical features for machine learning modeling. Are the two different (dtype object and category)? I read through the link you shared, from what i read it seems that we have dtype category . Isn’t it the same with categorical features?

In NumPy, object dtype refer to an array where instead to store the desired value, instead a pointer to the value is kept. In some way, this is really similar to a Python list. So, in practice, one can store numerical values in an object dtype array. For instance:

In [1]: import numpy as np

In [2]: np.array([1, 2, 3.0], dtype=object)
Out[2]: array([1, 2, 3.0], dtype=object)

However, this is not efficient memory-wise since one needs to store the pointer and the values and performance-wise because anytime you need to make a computation, you have a so-called indirection (you first need to check the pointer that will tell you where the values are located in memory) that slow down any computation.

So now about the categories and object dtype. Sometimes (even frequently) categories are encoded with strings. And in general, you don’t know in advance the length of strings and thus you cannot allocate a continuous NumPy array to store the values. In this case, you use an array of object dtype to store the pointer to each of the strings.

object dtype array is available for ages. Pandas created a specific data type category that is specifically designed categorical variable and that provides additional features (if you are familiar with R language, I assume that they try to mimic the factor variable type). However, this is not understood by NumPy nor automate any processing in scikit-learn up to today.

Hope that it brings some light.

2 Likes