Can you comment the use of the remainder parameter in ColumnTransformer in M1.05 and lesson before?

In M1.05 and the lesson before , you use the parameter ‘remainder’ and set it to "passthrough’ when only categorical data are processed but let in in default value (“drop”) when numerical data are processed too.

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")
preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), numerical_columns),
    ('categorical', OrdinalEncoder(handle_unknown="use_encoded_value",
                                   unknown_value=-1),
     categorical_columns)])

I did test to use remainder with drop value for categorical data processing and the accuracy of the model decrease.
Why is it different to ‘drop’ non-specified columns or to ‘passthrough’ them?
Do we have to set remainder on passthrough each time we processed only a part of our data or is it just for categorical data?

Thank for your answers

The use of the remainder is independent of the type of data (categorical or numerical). It is just a quick way to deal with columns that are not preprocessed by the specified pipeline (e.g. categorical_columns or numerical_columns. There are two ways to handle these unspecified columns:

  • drop: drop them and do not use these data later on in the pipeline;
  • passthrough: let them pass them as-is and concatenate them together with the preprocessed columns.

A question could be what if I want to let pass as-is some of the columns but not all. In this case, remainder is not enough. You will need to have an additional entry, for instance, let’s say that you have a dataframe with columns from "col_A", .... "col_Z":

preprocessor = ColumnTransformer([
    ('numerical', StandardScaler(), ["col_A", "col_B"]),
    ('categorical', OrdinalEncoder(), ["col_C", "col_D"]),
    ('untouched', "passthrough", ["col_E", "col_F"]),
    remainder="drop",
])

The dataframe given after transform will contain "col_A", "col_B" scaled, "col_C", "col_D" encoded, and "col_E", "col_F" as-is. The columns from "col_G" to "col_Z" will be dropped.

2 Likes

Just to be sure to have well understood.

When I do :

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

I’m applying the HistGrandientBoostingClassifier to categorical data preprocessed and to numerical data let as-is. And when I do :

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="drop")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

I’m applying the HistGrandientBoostingClassifier to only categorical data preprocessed ?

Am I right?

Yes exactly :slight_smile:

Actually, to be precise, it is not only numerical data but all data not in categorical_columns (in this case, it should be equivalent).

Thank , it s clearer in my mind now :slight_smile: