Can you comment the use of the remainder parameter in ColumnTransformer in M1.05 and lesson before?

echidne · 3 June 2021 13:59

In M1.05 and the lesson before , you use the parameter ‘remainder’ and set it to "passthrough’ when only categorical data are processed but let in in default value (“drop”) when numerical data are processed too.

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), numerical_columns),
    ('categorical', OrdinalEncoder(handle_unknown="use_encoded_value",
                                   unknown_value=-1),
     categorical_columns)])

I did test to use remainder with drop value for categorical data processing and the accuracy of the model decrease.
Why is it different to ‘drop’ non-specified columns or to ‘passthrough’ them?
Do we have to set remainder on passthrough each time we processed only a part of our data or is it just for categorical data?

Thank for your answers

glemaitre58 · 3 June 2021 17:36

The use of the remainder is independent of the type of data (categorical or numerical). It is just a quick way to deal with columns that are not preprocessed by the specified pipeline (e.g. categorical_columns or numerical_columns. There are two ways to handle these unspecified columns:

drop: drop them and do not use these data later on in the pipeline;
passthrough: let them pass them as-is and concatenate them together with the preprocessed columns.

A question could be what if I want to let pass as-is some of the columns but not all. In this case, remainder is not enough. You will need to have an additional entry, for instance, let’s say that you have a dataframe with columns from "col_A", .... "col_Z":

preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), ["col_A", "col_B"]),
    ('categorical', OrdinalEncoder(), ["col_C", "col_D"]),
    ('untouched', "passthrough", ["col_E", "col_F"]),
    remainder="drop",
])

The dataframe given after transform will contain "col_A", "col_B" scaled, "col_C", "col_D" encoded, and "col_E", "col_F" as-is. The columns from "col_G" to "col_Z" will be dropped.

echidne · 3 June 2021 19:34

Just to be sure to have well understood.

When I do :

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

I’m applying the HistGrandientBoostingClassifier to categorical data preprocessed and to numerical data let as-is. And when I do :

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="drop")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

I’m applying the HistGrandientBoostingClassifier to only categorical data preprocessed ?

Am I right?

glemaitre58 · 3 June 2021 19:37

Yes exactly

Actually, to be precise, it is not only numerical data but all data not in categorical_columns (in this case, it should be equivalent).

echidne · 3 June 2021 20:06

Thank , it s clearer in my mind now