Categorical features in adult census dataset?

Is this comment correct? Categorical features in adult census dataset?

The adult census contains some categorical data and we encode the categorical features using an OrdinalEncoder since tree-based models can work very efficiently with such a naive representation of categorical variables.

Since there are rare categories in this dataset we need to specifically encode unknown categories at prediction time in order to be able to use cross-validation. Otherwise some rare categories could only be present on the validation side of the cross-validation split and the OrdinalEncoder would raise an error when calling the its transform method with the data points of the validation set.

Yep, adult census is first dataset seen in the first chapter: First look at our dataset — Scikit-learn course

education or native-country are some categorical features.

This is a bit weird from your screenshot. We indeed have a numerical only dataset but the name is adult-census-numeric.csv.

Which notebook is this?

This is the notebook in Module 6 > Ensemble method using bootstrapping > Random Forest.

The file used is adult-census.csv

This is indeed bizarre. We use the same dataset in the very first activity, M1 > Tabular data exploration > First look at our dataset, and it has indeed categorical features:

I run via the FUN interface and it looks fine. Could you try to load the dataset in a sandbox notebook. Maybe you overwritten the file?

The two screenshots are also actually taken from running in the FUN interface (as you can see from the URLs). The numerical&categorical file is correctly accessed from M1, but the numerical file wrongly is from M6.

Same error from the sandbox – see pic below. Quite puzzling.

I am almost sure that something overwrites your file then. One thing that you can do is to overwrite the file again by loading the file from GitHub:

import pandas as pd
df = pd.read_csv("https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/adult-census.csv")
df.to_csv("../datasets/adult-census.csv", index=False)

Thanks. Works!