Wrap-up quiz 4 - Question 13

I have a problem with the model an cv_results, I get nan in cv_results[“test_score”].mean() and i dont know why.

The read of the data is the same as Q12:

adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.drop(columns=["class", "education-num"])

And my predictive model is

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate
from sklearn.impute import SimpleImputer

preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numerical', StandardScaler(), numerical_columns)
])

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

cv_results = cross_validate(model, data, target,
                           cv=10, return_estimator=True)

where the transformer is

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

The best would be to pass error_score="raise" in cross_validate to get the error to debug the code.
You will get the following message:

ValueError: Found unknown categories [' Holand-Netherlands'] in column 7 during transform

It means that we observe the category " Holand-Netherlands" at predict while we never saw it during fit. It is indeed a rare category.

The trick here is to set the option handle_unknown in OneHotEncoder:

OneHotEncoder(handle_unknown="ignore")