Is this comment correct? Categorical features in adult census dataset?
The adult census contains some categorical data and we encode the categorical features using an
OrdinalEncoder
since tree-based models can work very efficiently with such a naive representation of categorical variables.
Since there are rare categories in this dataset we need to specifically encode unknown categories at prediction time in order to be able to use cross-validation. Otherwise some rare categories could only be present on the validation side of the cross-validation split and the
OrdinalEncoder
would raise an error when calling the itstransform
method with the data points of the validation set.