Question 4 in WQ5 is misleading

echidne · 22 June 2021 18:36

You say :

Instead of using only the numerical dataset (which was the variable data_numerical ), use the entire dataset available in the variable data .

So as numerical features I used the numerical_features list you created in start of the quiz as argument in the preprocessor:

preprocessor = ColumnTransformer(transformers=[("cat-preprocessor", imputer_ordinal_transformer, categorical_columns),("num-preprocessor", scaler_imputer_transformer, numerical_features)])

But in doing that I obtained only a score of ~0.72 since in your defined numerical_features only a part of the numerical columns names are present.

So I propose you modify the first sentence as :

Instead of using only a part of the numerical dataset (which was the variable data_numerical ), use the entire dataset available in the variable data .

glemaitre58 · 22 June 2021 19:09

I corrected this question today for this specific reason:

Create a preprocessor by dealing separately with the numerical and categorical columns. For the sake of simplicity, we will define the categorical columns as the columns with an object data type while all other columns will be considered as numerical columns.

We also added the following:

Fix the random state of the tree by passing the parameter random_state=0

https://inria.github.io/scikit-learn-mooc/trees/trees_wrap_up_quiz.html

I can see it on FUN. Maybe you need to refresh the page since we updated the question recently.

echidne · 22 June 2021 22:33

You wrote :

Instead of using only the numerical dataset (which was the variable data_numerical), use the entire dataset available in the variable data.

Create a preprocessor by dealing separately with the numerical and categorical columns. For the sake of simplicity, we will define the categorical columns as the columns with an object data type while all other columns will be considered as numerical columns.

Sorry but me when I read that 3 sentences I really thought the numerical dataset to be used was the variable data_numerical i.e. data[numerical_features] and then numerical_features has to be used as argument in the preprocessor since if it was THE numerical dataset. So all the numerical colums were in the numerical_features list .

If you modify the text as I proposed there ll be no more ambiguity.

glemaitre58 · 23 June 2021 07:29

@lfarhi @MarieCollin Could you update FUN with https://gitlab.inria.fr/learninglab/mooc-scikit-learn/mooc-scikit-learn-coordination/-/commit/e7647e2763b0cc7d031a89c7293f91152d487ed8

glemaitre58 · 23 June 2021 07:29

I made a rephrasing: FIX make question more explicit · INRIA/scikit-learn-mooc@bbf837b · GitHub

echidne · 23 June 2021 09:37

It’s better but I think it could be clearer as :

Instead of using only a subset of the numerical data (which we previously used to create the variable data_numerical) …

echidne · 23 June 2021 09:38

BTW why you did not use the entire numerical dataset at the beginning of the exercise?

glemaitre58 · 23 June 2021 09:50

Some of these features are categorical features encoded with numerical values.

lfarhi · 23 June 2021 10:10

This is also corrected in FUN now

echidne · 24 June 2021 12:17

import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)

numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
numerical_columns = numerical_columns_selector(data)

if :

[el for el in numerical_features if el not in numerical_columns]

output :

[]

if :

[el for el in numerical_columns if el not in numerical_features]

ouput :

['Id',
 'MSSubClass',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'GarageYrBlt',
 'MoSold',
 'YrSold']

so the numerical_features list you created in start of the quizz is a subset of the numerical columns i.e. the data that have not a dtype=‘object’ even if some ones are categorical features indeed.

The problem come to the fact that data scientist (and statistician) use “numerical data” to design “quantitative data” when “categorical data” can be numerical too

That means also that some categorical data (and datetime data) will be processed as numerical in the model. Should have not been not better to change the dtypes of these features? Or to process tem as they are with the other categorical data?

glemaitre58 · 24 June 2021 13:03

Pandas are recently proposing a category dtype that is perfect for this purpose. However, there is no automatic manner to detect them: one needs to define which columns could be considered as categorical. Indeed, I would recommend using this approach because it is more explicit when passing data around. I assume that in the long run, we will take profit of this data type in scikit-learn (or side project) to have some more automatic preprocessing.

Regarding the date, one might want to go beyond just a categorical encoding: usually, the data are encoding some kind of periodicity that could be useful. For instance, @ogrisel was working in scikit-learn to show with a concise example that periodicity could be useful and modeled: https://github.com/scikit-learn/scikit-learn/pull/20281