Decision Boundary Display blocked - Help needed :)

Hello everyone,

I’m stuck when displaying the Decision Boundary figure for this exercise.

Code is very simple :

adult_census = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/adult-census-numeric.csv")

data = adult_census.drop(columns="class")

model = KNeighborsClassifier(n_neighbors=5)

model.fit(data, target)

DecisionBoundaryDisplay.from_estimator(estimator=model, X=data, response_method='predict')

This is the error code I receive, but my data has not AT ALL any NaN value.
I even tried to use a SimpleImputer to replace 0 values by the ‘most-frequent’ value and it still does not work.

Any idea of why it does not work ?

Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

However, when using a HistGradBoostClassifier, the Decision Boundary displays well and return this figure.

Question : Why does the Decision Boundary displays ‘capital-gain’ & ‘age’ and not ‘capital-loss’ & ‘hours-per-week’

Hello @Ngofgeu,

Remember that DecisionBoundaryDisplay will make a plot in 2 dimensions and therefore expects the input data X to be of shape (n_samples, 2). In that sense what you can do is train the model on pairs of features, as follows:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.inspection import DecisionBoundaryDisplay

adult_census = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/adult-census-numeric.csv")
data = adult_census[["age", "capital-gain"]]
target = adult_census["class"]

model = KNeighborsClassifier(n_neighbors=5)
model.fit(data, target)

DecisionBoundaryDisplay.from_estimator(estimator=model, X=data, response_method='predict')

Keep in mind that using a subset of the features to train the model may likely result in a boundary decision that does not represent the reality of the full model.

The reason that HistGradBoostClassifier bypasses this constraint is indeed a bug, which seems to be selecting the first two features as plotting variables but with arbitrary values for the other columns.

We will take a look at this issue. Thanks for reporting!

Thanks a lot Arturo !! Very appreciated and happy to have report a small bug here :grin:

1 Like

For information, I just opened a PR to address this issue:

1 Like