LabelEncoder
is a transformer to encode labels. Its usage is:
In [1]: from sklearn.preprocessing import LabelEncoder
In [2]: LabelEncoder().fit_transform(["cat", "dog", "dog", "cat", "cat"])
Out[2]: array([0, 1, 1, 0, 0])
It means that you transform non-numerical labels into numerical labels in a classification setting.
Usually, you never have to use this transformer because all classifiers in scikit-learn will use internally this transformer to encode and decode the labels and to provide you non-numerical target if it was originally your input.
In the first version of the MOOC, we had examples. I don’t remember if we removed them to simplify the learning process. In a very condensed way, there are the following strategies:
- use an imputer (cf. API Reference — scikit-learn 1.0.2 documentation) to replace missing values with another value. For categories, it makes sense only to consider a missing value as a category on its own that will be later encoded via the
OneHotEncoder
or OrdinalEncoder
. For numerical features, there are multiple options and in practice this is difficult to do better than a simple imputation.
- use a predictor that handles missing values (e.g.
HistGradientBoosting
).