'passthrough' causing an error

navneethc · 3 March 2022 22:28

In the section Scaling numerical features, I thought the exercise asked me to transform the numerical features alone while retaining the categorical, so I set up the pipeline as follows:

[Assume I have run the previous cells in order.]

from sklearn.preprocessing import StandardScaler

numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
    ('numerical', numerical_preprocessor, numerical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
cv_results = cross_validate(model, data, target, error_score='raise')
elapsed_time = time.time() - start

scores = cv_results["test_score"]

print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f} "
      f"with a fitting time of {elapsed_time:.3f}")

This gives me a long traceback ending with:

ValueError: could not convert string to float: ' State-gov'

Why is the transformer being applied to categorical columns? What am I missing?

glemaitre58 · 4 March 2022 10:16

By adding remainder="passthrough", it means that all columns not specified in the previous transformers will be send-as-is. So the categorical columns are sent as-is and thus string are not encoded.

I assume that for the numerical part, we asked to set remainder="drop" which is indeed the default.

navneethc · 4 March 2022 14:28

Exactly, they should not be encoded, yet StandardScaler is being applied to the categorical columns.

The instructions for this section is:

Let’s write a similar pipeline that also scales the numerical features using StandardScaler (or similar):

glemaitre58 · 4 March 2022 14:42

No, the categorical columns are let as-is and not passed to any transformer. This is the meaning of passthrough.

So basically, you need to do both:

encode categorical columns using an OrdinalEncoder
and scale the numerical columns using a StandardScaler

In this case, they will be no remainder to pass because all columns will either be numerical or cateogrical.