I don't fully understand the step 4

EtienneJouin · 21 February 2022 23:05

Hello,
this course is very challenging and interesting, thank you.

I think I got the concepts and methods, but I have some difficulties to grasp fully the step 4 :

numerical_columns = [
    "age", "education-num", "capital-gain", "capital-loss",
    "hours-per-week"]

This is ok, I understand the purpose.

categorical_columns = [
    "workclass", "education", "marital-status", "occupation",
    "relationship", "race", "sex", "native-country"]

same logic, same comprehension for me.

all_columns = numerical_columns + categorical_columns + [target_column]

OK, we have 3 lists of columns, including one single.

adult_census = adult_census[all_columns]

That’s where I’m stuck : it seems so redundant for me!
Is this necessary as part of the data preparation, i.e. grouping the columns by usage before passing the data to the functions ? are these names mandatory ?

Thank you for your help

ArturoAmorQ · 22 February 2022 08:52

Hello,

Thanks for your kind words! You are right. It is redundant to rebuild the dataset by grouping the columns by usage. But the goal of this notebook is to show the syntax for selecting a subset of columns, how they can be added and finally how to identify categorical, numerical and target columns by hand. This last point is something that gets explored more in detail in future notebooks.

So grouping the columns by type is mandatory. Setting this particular names and rebuilding the dataset are things we do for didactic purposes, but are not necessary.

EtienneJouin · 22 February 2022 18:16

Thank you for this clear response