Pipeline or make_pipeline + grid searchCV

Hi Teachers,
I’ve a few questions about Pipeline/make_pipeline, I’ve understood your explanations, however, we got one step in each type of variable, like the StandardScaler for nums, and categorical encoding, if I have more than one step as: (num variable) imputer + then standard scaler ?
Using pipeline seems to more more explicity to create steps, with make pipeline I’ve tried without success.

Another question relate to is, is it possible concataned a gridsearch cv, inside the pipeline ?

When you have several steps, you provide the order in which you want the sequence of transformers to happen. For instance:

make_pipeline(StandardScaler(), SimpleImputer(), LogisticRegression())

Here, you specify to first apply the scaler, then the imputer and finally train/predict with a logistic regression.

The difference between Pipeline and make_pipeline is that you can decide the name of the element in the pipeline with Pipeline while it will be automatically assigned with make_pipeline (it will use the name of the class). So I used make_pipeline above. The equivalent with Pipeline is:

    ("scaler", StandardScaler()),
    ("imputer", SimpleImputer()),
    ("classifier", LogisticRegression()),

Usually this is the opposite: you will provide a pipeline as base model in the grid-search:

model = make_pipeline(StandardScaler(), LogisticRegression())
search_cv = GridSearchCV(model, param_grid={"logisticregression__C": [0.1, 1, 10]})
search_cv.fit(X, y)

Hi glemaitre58,
Thanks for the reply, I got your explanation, still, there is one question remain to me, the steps you described can be applied to numeric processing and categoric processing ? We did some exercises where we split the “steps” for each type of variable, in the way you explain, I can apply the same idea ???

Thanks for your attention
Fabio C. Lima

In this case, you will use a ColumnTransformer where the parameter transformers is a list of tuple that can contains some Pipeline.

Thanks for again for your time. I’ve been enjoying this MOOC very much, the majority explanations are clear and straightforward. Is it to much ask for a exemple about make_pipeline with several steps with numeric and categorical features?


import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

categorical_preprocessor = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=-1), OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = make_pipeline(
    StandardScaler(), SimpleImputer()

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)])

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

Thanks glemaitre58,
After I read carefully your reply I got my previous mistake. Sorry for the English in the questions related to nested cross-validation, I was referring to inner and outer steps, but you already answered my questions. I should choose a inner and outer “process” with caution, in the same way, you guys explained to us in this module.

OK great :slight_smile: