Does the encoding happen per split?

christonikos · 2 November 2022 19:02

if the sample ends up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode it.

Am I right to assume that based on the above sentence, the encoding of categorical features happens per fold during the cross-validation? If that’s the case, using the ‘ignore’ option in the categorical encoder means, “ignore the fold that contains the non encoded value?”. In this case, though, shall we increase the number of folds significantly to avoid skipping many folds?

Thanks a lot

ArturoAmorQ · 3 November 2022 17:08

The encoding does happen per split. But the handle_unknown="ignore" option does not mean the fold is ignored. It means that if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros (the zero vector).

For instance imagine the one hot encoder is trained on a categorical variable that only contains “cat” and “dog” during fit, then it will allocate a 2-dimensional space:
“cat” → [1, 0]
“dog” → [0, 1]
but then it finds a “snake” during test, then
“snake” → [0, 0]

For a linear model this means that whenever an algorithm requires to multiply this [0, 0] vector by some weight, it will just not contribute to the learning of the model. It will be “ignored”.

christonikos · 3 November 2022 21:11

Once again, fantastic explanation, Arturo. Thanks.

Nouridine-dino · 12 December 2022 11:06

thank you @ArturoAmorQ I were about to ask the question, now it’s ok I get it