OneHotEncoder vs pd.dummy

Hello,

I have a concern using OneHotEncoder. I understand that when passing a DataFrame result of this transformation becomes a np.array. Then I find difficult to be sure about transformation made since columns are no more identified.

For that I tend to prefer using pandas. get_dummies because we still get a DataFrame with clear column encoding. (feature name+categorie)

Is there any drawbacks using pd.getdumming instead of OneHotEncode?

In addition I understand that same transformation (onehotencode) is to be fitted on train set, but also applied on test set or fresh datas. For this I also prefer pandas solution for better control (even if t not solve the issue of potential missing categories)

Thanks for feedback & good practice sharing …if I misunderstand one point (still a beginner…) Bye & Enjoy!

It does. If you get rare categories, you will get into trouble with pd.get_dummies. Let’s give an example:

import numpy as np
import pandas as pd

X_train = np.array(["A", "B", "C"])
X_test = np.array(["D", "B", "C"])
pd.get_dummies(X_train)

 	A 	B 	C
0 	1 	0 	0
1 	0 	1 	0
2 	0 	0 	1
pd.get_dummies(X_test)

 	B 	C 	D
0 	0 	0 	1
1 	1 	0 	0
2 	0 	1 	0

Using get_dummies, we know that the encoding of the training and testing data is not the same because we have the column names. However, internally scikit-learn will convert these dataframe into NumPy array (machine-learning boiled down to numerical processing).

Here, the two arrays have exactly the same dimension. So any classifier in scikit-learn would work. However, the encoding is wrong.

Using a OneHotEncoder within a scikit-learn Pipeline allows to detect such inconsistencies: find that a new category is present during testing (you can ignore using the parameter handle_unknown) or make sure that encoding with fewer categories results in the same array shapes.

Currently, scikit-learn allows to build a DataFrame with column names after transform using get_feature_names:

from sklearn.preprocessing import OneHotEncoder

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

encoder = OneHotEncoder(handle_unknown="ignore", sparse=False).fit(X_train)
X_train_encoded = pd.DataFrame(
    encoder.transform(X_train),
    columns=encoder.get_feature_names(input_features=["Col_1"])
)
X_train_encoded

 	Col_1_A 	Col_1_B 	Col_1_C
0 	1.0 	0.0 	0.0
1 	0.0 	1.0 	0.0
2 	0.0 	0.0 	1.0
X_test_encoded = pd.DataFrame(
    encoder.transform(X_test),
    columns=encoder.get_feature_names(input_features=["Col_1"])
)
X_test_encoded

 	Col_1_A 	Col_1_B 	Col_1_C
0 	0.0 	0.0 	0.0
1 	0.0 	1.0 	0.0
2 	0.0 	0.0 	1.0

We are currently working on making it easier to column names in Pipeline in scikit-learn to reduce the boiler plate.

Handling data outside of a scikit-learn Pipeline is prone to fall into some common pitfalls: 10. Common pitfalls and recommended practices — scikit-learn 1.0.dev0 documentation (data leakage, inconsistency in preprocessing).

1 Like

Dear Guillaume, Thnaks for this very clear reply. I have already experienced drawback mentioned. And ok you convinced me about use of internal scikit onehot encoder solution. I have to admit that I may be quite old fashioned and not so happy to put all in in “black box pipelne” …but looking more in detail I see some onehotencoder output could help to check data consistency.

So thanks again for details provided.

In addition : with some solver I had sparse matric issues after onehotencoding (solved with xa=xa.todense()), on train, test and application sets. Hope this is good way…

Finally, I have some issues with randomundersampler solutions (provided by you I believe…), but we are just in frst week, so probably will understand later in this course.

To conclude : I would like to thanks first for this MOOC but also more generally for tools you and your team are providing to the community. I am not working in IT industry so honnestly had no specific knowledge of INRIA activities before my previous PYTHON MOOC. Since then I am a big fan…and still believe your activities & results are not know enough… Bye

There is a parameter sparse=True/False in OneHoteEncoder to force the output type.

I agree that Pipeline looks magical at first but they will be your best friends. Once you know how to create and train them, you will become efficient at inspecting them (6.1. Pipelines and composite estimators — scikit-learn 0.24.2 documentation).

We still need to make this Pipeline more friendly for some cases but it provides already some great features avoiding basic errors :slight_smile: