Coef shape problem

efenaux · 24 March 2022 17:45

During cross validation I get this warning message

UserWarning: Found unknown categories in columns [7] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

corresponding shape of coef is (104,) and (105,) for other
Then I excluded to get the dataframe
Is there a way to get same shape for all estimators ?

ArturoAmorQ · 24 March 2022 19:02

Could you provide a code snippet that raises the error? It should contain how you transform your categorical variables so that we can help you debug your code.

efenaux · 24 March 2022 21:07

Of course, here it is

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

preprocessor = make_column_transformer(
    (OneHotEncoder(sparse=False,drop="if_binary", handle_unknown="ignore", dtype=np.int32), categorical_columns),
    (StandardScaler(),numerical_columns),
    verbose_feature_names_out=True,
)

model = make_pipeline(
    preprocessor, LogisticRegression(max_iter=1000))

cv_results = cross_validate(model, data, target,
                            cv=10, scoring="accuracy",
                            return_train_score=True,
                            return_estimator=True)

and the solution I used : 
coefs = [ ]
for est in cv_results["estimator"]:
    if est[-1].coef_[0].shape[0]==105:
        coefs.append(est[-1].coef_[0])

ArturoAmorQ · 25 March 2022 13:46

The parameter drop="if_binary" in your OneHotEncoder is causing the problem, as there is always one fold where one of the categories is being dropped and not in the other folds. For the moment your problem can be solved by setting drop=None (i.e. the default value).

My guess is that the problem occurs when trying to encode the data['native-country'] == "Holand-Netherlands" sample, as discussed in this forum comment.

glemaitre58 · 26 March 2022 09:07

To be clear, scikit-learn will raise a UserWarning such that you are aware of this rare category problem but it should not raise an error.

In the future (next scikit-learn release), we will have a more dedicated strategy for these infrequent categories by setting handle_unknown="infrequent_if_exist" (cf. documentation: sklearn.preprocessing.OneHotEncoder — scikit-learn 1.1.dev0 documentation)