Data set

ldziej · 26 June 2021 10:52

Excuse me, I have some doubts, regarding creating data tables, and make pipelines:
for instance:
Could you tell me please the diference between:

data, target = fetch_california_housing(as_frame=True, return_X_y=True)

and

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?").
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

Also the diferencee between:
make_pipeline
and
Pipeline

And:
If the CV in RidgeCV and others (ie GridSearchCV) is apart from de cross validation on the date set?
Thank you so much

glemaitre58 · 27 June 2021 13:36

ldziej:

data, target = fetch_california_housing(as_frame=True, return_X_y=True)

and

ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?").
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

fetch_california_housing is a function implemented in scikit-learn to fetch the California housing from internet. But it is specialized only for this dataset. The parameter as_frame make sure that we return pandas dataframe while return_X_y, means that return 2 variable X and y that are the data and the target respectively. Otherwise, a dictionary will be returned that contains the data and target but also some additional meta data.

pd.read_csv will read any CSV file by providing the path of the file. Then, we need to split this dataframe to only have the X and y variable.

When using Pipeline, you define the steps with a specific name:

model = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression()),
])

So here, we define the name of the steps ("scaler" and "classifier"). In `make_pipeline`, we don't define this name and it will be directly define created from the name of the class:

```python
model = make_pipeline(StandardScaler(), LogisticRegression())

In this case, the step of the StandardScaler will be called “standardscaler” and LogisticRegression will be called logisticregression.

It means that an internal cross-validation will take place when calling fit. Providing these estimators in cross_validate will induce 2 cross-validation: an inner cross-validation by the model and an outer cross-validation done by cross_validate.