Fitting a model with "unprocessed" data

AWNeto · 26 May 2021 01:16

The following statements were taken from the lecture HANDLING CATEGORICAL DATA:

Then, we can send the raw dataset straight to the pipeline. Indeed, we do not need to make any manual preprocessing (calling the transform or fit_transform methods) as it will be handled when calling the predict method.

The notebook then proceeds to call the fit
function to train a ML model with an unscaled, unprocessed train set:

_ = model.fit(data_train, target_train)

Why can we do that? Shouldn’t we pre-process the data before fitting a model?

ThomasLoock · 26 May 2021 05:38

Hi AWNeto,
the model is a pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

and the preprocessor is a transformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)
    ])

When calling model.fit() the data goes through the pipeline, is transformed and then the regression is done.

glemaitre58 · 26 May 2021 08:12

Here “transformed” mean that:

categorical data are one-hot encded;
numerical data are scaled.

Bottom line, scikit-learn does the necessary transformation when using a pipeline containing transformers.