Pipeline or make_pipeline + grid searchCV

FabioCLima · 26 June 2021 02:49

Hi Teachers,
I’ve a few questions about Pipeline/make_pipeline, I’ve understood your explanations, however, we got one step in each type of variable, like the StandardScaler for nums, and categorical encoding, if I have more than one step as: (num variable) imputer + then standard scaler ?
Using pipeline seems to more more explicity to create steps, with make pipeline I’ve tried without success.

Another question relate to is, is it possible concataned a gridsearch cv, inside the pipeline ?

glemaitre58 · 27 June 2021 13:11

When you have several steps, you provide the order in which you want the sequence of transformers to happen. For instance:

make_pipeline(StandardScaler(), SimpleImputer(), LogisticRegression())

Here, you specify to first apply the scaler, then the imputer and finally train/predict with a logistic regression.

The difference between Pipeline and make_pipeline is that you can decide the name of the element in the pipeline with Pipeline while it will be automatically assigned with make_pipeline (it will use the name of the class). So I used make_pipeline above. The equivalent with Pipeline is:

Pipeline([
    ("scaler", StandardScaler()),
    ("imputer", SimpleImputer()),
    ("classifier", LogisticRegression()),
])

Usually this is the opposite: you will provide a pipeline as base model in the grid-search:

model = make_pipeline(StandardScaler(), LogisticRegression())
search_cv = GridSearchCV(model, param_grid={"logisticregression__C": [0.1, 1, 10]})
search_cv.fit(X, y)

FabioCLima · 28 June 2021 09:47

Hi glemaitre58,
Thanks for the reply, I got your explanation, still, there is one question remain to me, the steps you described can be applied to numeric processing and categoric processing ? We did some exercises where we split the “steps” for each type of variable, in the way you explain, I can apply the same idea ???

Thanks for your attention
Fabio C. Lima

glemaitre58 · 28 June 2021 17:40

In this case, you will use a ColumnTransformer where the parameter transformers is a list of tuple that can contains some Pipeline.

FabioCLima · 28 June 2021 19:40

Thanks for again for your time. I’ve been enjoying this MOOC very much, the majority explanations are clear and straightforward. Is it to much ask for a exemple about make_pipeline with several steps with numeric and categorical features?

Thanks

glemaitre58 · 29 June 2021 07:34

import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

categorical_preprocessor = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=-1), OneHotEncoder(handle_unknown="ignore")
)
numerical_preprocessor = make_pipeline(
    StandardScaler(), SimpleImputer()
)

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)])



from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

FabioCLima · 29 June 2021 09:11

Thanks glemaitre58,
After I read carefully your reply I got my previous mistake. Sorry for the English in the questions related to nested cross-validation, I was referring to inner and outer steps, but you already answered my questions. I should choose a inner and outer “process” with caution, in the same way, you guys explained to us in this module.

glemaitre58 · 29 June 2021 09:23

OK great