Encoding Categorical Variables and Data Cleaning

SandraOriji · 27 February 2022 20:12

Personally to encode Categorical variables i use Label Encoder or I map the categorical features myself, hope this model is good for encoding.
Also in data["native-country"].value_counts() is saw this ? as part of the values for native-country. My question is this: Can’t this value be replaced to something else? And what can we use to replace it? and Won’t it affect the model during training?

Thanks

glemaitre58 · 28 February 2022 14:34

You should not use LabelEncoder. LabelEncoder is reserved for transforming the target y. In the past, due to limitations in the OneHotEncoder, people gave the advice to use LabelEncoder to encode the data X. Nowadays, this is not the case anymore and one should use OneHotEncoder that works for X and can be used in ColumnTransformer and Pipeline.

This is an indicator of missing values. Here, we consider ? as a separate category. Sometimes, np.nan, pd.NA, etc. are used as markers. With this strategy, one might need to impute missing data before to provide the data to a predictive model if this model does not handle missing values.

SandraOriji · 6 March 2022 20:02

@glemaitre58 Thanks for the explanation. But i still have some questions. You said LabelEncoder is reserved for transforming the target y. How? y is the target features that is unknown so to speak. How then do i have to transform it. In the adult_census dataset we are working on, we have “class” as the target feature. How do i encode it.
With regards to ? as a missing value and how to sort it, Are we going see that play out in this course?

glemaitre58 · 7 March 2022 08:21

LabelEncoder is a transformer to encode labels. Its usage is:

In [1]: from sklearn.preprocessing import LabelEncoder

In [2]: LabelEncoder().fit_transform(["cat", "dog", "dog", "cat", "cat"])
Out[2]: array([0, 1, 1, 0, 0])

It means that you transform non-numerical labels into numerical labels in a classification setting.
Usually, you never have to use this transformer because all classifiers in scikit-learn will use internally this transformer to encode and decode the labels and to provide you non-numerical target if it was originally your input.

In the first version of the MOOC, we had examples. I don’t remember if we removed them to simplify the learning process. In a very condensed way, there are the following strategies:

use an imputer (cf. API Reference — scikit-learn 1.0.2 documentation) to replace missing values with another value. For categories, it makes sense only to consider a missing value as a category on its own that will be later encoded via the OneHotEncoder or OrdinalEncoder. For numerical features, there are multiple options and in practice this is difficult to do better than a simple imputation.
use a predictor that handles missing values (e.g. HistGradientBoosting).