Decision Boundary Display blocked - Help needed :)

Ngofgeu · 30 November 2022 11:18

Hello everyone,

I’m stuck when displaying the Decision Boundary figure for this exercise.

Code is very simple :

adult_census = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/adult-census-numeric.csv")

data = adult_census.drop(columns="class")

model = KNeighborsClassifier(n_neighbors=5)

model.fit(data, target)

DecisionBoundaryDisplay.from_estimator(estimator=model, X=data, response_method='predict')

This is the error code I receive, but my data has not AT ALL any NaN value.
I even tried to use a SimpleImputer to replace 0 values by the ‘most-frequent’ value and it still does not work.

Any idea of why it does not work ?

Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

However, when using a HistGradBoostClassifier, the Decision Boundary displays well and return this figure.

Question : Why does the Decision Boundary displays ‘capital-gain’ & ‘age’ and not ‘capital-loss’ & ‘hours-per-week’

ArturoAmorQ · 30 November 2022 14:17

Hello @Ngofgeu,

Remember that DecisionBoundaryDisplay will make a plot in 2 dimensions and therefore expects the input data X to be of shape (n_samples, 2). In that sense what you can do is train the model on pairs of features, as follows:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.inspection import DecisionBoundaryDisplay

adult_census = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/adult-census-numeric.csv")
data = adult_census[["age", "capital-gain"]]
target = adult_census["class"]

model = KNeighborsClassifier(n_neighbors=5)
model.fit(data, target)

DecisionBoundaryDisplay.from_estimator(estimator=model, X=data, response_method='predict')

Keep in mind that using a subset of the features to train the model may likely result in a boundary decision that does not represent the reality of the full model.

The reason that HistGradBoostClassifier bypasses this constraint is indeed a bug, which seems to be selecting the first two features as plotting variables but with arbitrary values for the other columns.

We will take a look at this issue. Thanks for reporting!

Ngofgeu · 30 November 2022 15:06

Thanks a lot Arturo !! Very appreciated and happy to have report a small bug here

ArturoAmorQ · 30 November 2022 15:12

For information, I just opened a PR to address this issue:

github.com/scikit-learn/scikit-learn

MAINT Array validation for DecisionBoundaryDisplay

scikit-learn:main ← ArturoAmorQ:DecisionBoundaryDisplay

opened 03:10PM - 30 Nov 22 UTC

ArturoAmorQ

+2 -0

#### Reference Issues/PRs  #### What does this implement/fix? Explain your changes. `DecisionBoundaryDisplay` expects the input data `X` to be of shape (`n_samples`, 2), but this is not ensured by an internal validation. Because of this, the following code will output an unintuitive `ValueError` message: ```python import numpy as np import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.inspection import DecisionBoundaryDisplay rng = np.random.RandomState(0) X = pd.DataFrame(np.random.randn(5, 4)) y = pd.Series(rng.randint(0, 2, 5)) model = KNeighborsClassifier() model.fit(X, y) DecisionBoundaryDisplay.from_estimator(estimator=model, X=X) ``` Even worse, a further ```python from sklearn.ensemble import HistGradientBoostingClassifier model = HistGradientBoostingClassifier() model.fit(X, y) DecisionBoundaryDisplay.from_estimator(estimator=model, X=X) ``` makes a plot! This PR is a simple fix to solve the issue. #### Any other comments? A meta comment: I am wondering if we could add a parameter `column_values` (or similar) that accepts dictionaries to fix values for `n_features - 2` of the variables, allowing then for surface curves of the decision boundary display.