[Offtop] Categorical collinear features

PvPDantes · 11 April 2022 00:50

Hello!
As I understand from Exrecise M4.04 it is better to use OneHotEncoder with drop='first' ro reduce collinear features from OneHotEncoder itself. And behind of task description it also would be better to use it in the current Wrap-up quiz. Am I right?

ogrisel · 12 April 2022 09:42

It’s ok to drop one column for binary categorical features. However for non-binary categorical features, dropping one will introduce an asymmetry between features, especially when using penalized linear models, and this inductive bias might not be wanted. So I am not sure if we should make any general statement on whether or not it’s always a good idea to drop one one-hot encoded feature.

Maybe it’s ok to have collinear but more symmetric features, as long as you use some regularization.

Note: maybe we should add an option in scikit-learn to drop the most frequent category instead of the first in lexicographical order.