It does. If you get rare categories, you will get into trouble with pd.get_dummies
. Let’s give an example:
import numpy as np
import pandas as pd
X_train = np.array(["A", "B", "C"])
X_test = np.array(["D", "B", "C"])
pd.get_dummies(X_train)
A B C
0 1 0 0
1 0 1 0
2 0 0 1
pd.get_dummies(X_test)
B C D
0 0 0 1
1 1 0 0
2 0 1 0
Using get_dummies
, we know that the encoding of the training and testing data is not the same because we have the column names. However, internally scikit-learn will convert these dataframe into NumPy array (machine-learning boiled down to numerical processing).
Here, the two arrays have exactly the same dimension. So any classifier in scikit-learn would work. However, the encoding is wrong.
Using a OneHotEncoder
within a scikit-learn Pipeline
allows to detect such inconsistencies: find that a new category is present during testing (you can ignore using the parameter handle_unknown
) or make sure that encoding with fewer categories results in the same array shapes.
Currently, scikit-learn allows to build a DataFrame
with column names after transform using get_feature_names
:
from sklearn.preprocessing import OneHotEncoder
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
encoder = OneHotEncoder(handle_unknown="ignore", sparse=False).fit(X_train)
X_train_encoded = pd.DataFrame(
encoder.transform(X_train),
columns=encoder.get_feature_names(input_features=["Col_1"])
)
X_train_encoded
Col_1_A Col_1_B Col_1_C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
X_test_encoded = pd.DataFrame(
encoder.transform(X_test),
columns=encoder.get_feature_names(input_features=["Col_1"])
)
X_test_encoded
Col_1_A Col_1_B Col_1_C
0 0.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
We are currently working on making it easier to column names in Pipeline
in scikit-learn to reduce the boiler plate.
Handling data outside of a scikit-learn Pipeline
is prone to fall into some common pitfalls: 10. Common pitfalls and recommended practices — scikit-learn 1.0.dev0 documentation (data leakage, inconsistency in preprocessing).