Categorical features in adult census dataset?

Marc_In_Singapore · 24 June 2021 10:53

Is this comment correct? Categorical features in adult census dataset?

The adult census contains some categorical data and we encode the categorical features using an OrdinalEncoder since tree-based models can work very efficiently with such a naive representation of categorical variables.

Since there are rare categories in this dataset we need to specifically encode unknown categories at prediction time in order to be able to use cross-validation. Otherwise some rare categories could only be present on the validation side of the cross-validation split and the OrdinalEncoder would raise an error when calling the its transform method with the data points of the validation set.

glemaitre58 · 24 June 2021 12:56

Yep, adult census is first dataset seen in the first chapter: First look at our dataset — Scikit-learn course

education or native-country are some categorical features.

This is a bit weird from your screenshot. We indeed have a numerical only dataset but the name is adult-census-numeric.csv.

Which notebook is this?

Marc_In_Singapore · 24 June 2021 13:48

This is the notebook in Module 6 > Ensemble method using bootstrapping > Random Forest.

The file used is adult-census.csv

Marc_In_Singapore · 24 June 2021 13:57

This is indeed bizarre. We use the same dataset in the very first activity, M1 > Tabular data exploration > First look at our dataset, and it has indeed categorical features:

glemaitre58 · 24 June 2021 16:58

I run via the FUN interface and it looks fine. Could you try to load the dataset in a sandbox notebook. Maybe you overwritten the file?

Marc_In_Singapore · 24 June 2021 23:04

The two screenshots are also actually taken from running in the FUN interface (as you can see from the URLs). The numerical&categorical file is correctly accessed from M1, but the numerical file wrongly is from M6.

Same error from the sandbox – see pic below. Quite puzzling.

glemaitre58 · 25 June 2021 09:35

I am almost sure that something overwrites your file then. One thing that you can do is to overwrite the file again by loading the file from GitHub:

import pandas as pd
df = pd.read_csv("https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/adult-census.csv")
df.to_csv("../datasets/adult-census.csv", index=False)

Marc_In_Singapore · 25 June 2021 10:46

Thanks. Works!