Hello!
As I understand from Exrecise M4.04 it is better to use OneHotEncoder
with drop='first'
ro reduce collinear features from OneHotEncoder
itself. And behind of task description it also would be better to use it in the current Wrap-up quiz. Am I right?
It’s ok to drop one column for binary categorical features. However for non-binary categorical features, dropping one will introduce an asymmetry between features, especially when using penalized linear models, and this inductive bias might not be wanted. So I am not sure if we should make any general statement on whether or not it’s always a good idea to drop one one-hot encoded feature.
Maybe it’s ok to have collinear but more symmetric features, as long as you use some regularization.
Note: maybe we should add an option in scikit-learn to drop the most frequent category instead of the first in lexicographical order.
1 Like