Wrap Up Quiz Error

Hello,
I am trying to complete 01 wrap up quiz.
Currently on question 6 but I cannot work out why i am getting the error: ‘Found input variables with inconsistent numbers of samples: [24, 14, 60]’

This occurs when I try to split the numerical_features data and target data into train and test data; or when I try to run the the cross validation.

I cannot work out what I am missing or doing wrong in the code.
Please help.

Code:

import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]
goal = (target > 200_000).astype(int)

numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

data_train, data_test, target_train, target_test = train_test_split(
    numerical_features, target, random_state=42, test_size=0.25)

model = make_pipeline(StandardScaler(),SimpleImputer(strategy = 'mean'), LogisticRegression())
model

you are passing numerical_features that is a list of the feature names to select. You need to pass the data array containing only the numerical features, meaning data_numerical = data[numerical_features]

Ahh ok, thank you.
Feel silly now but I am new to coding haha :slight_smile:

I’ve hit another problem… the cross validation does run but i get this warning:

‘/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_split.py:666: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
warnings.warn((“The least populated class in y has only %d”.’

when running the code:
from sklearn.model_selection import cross_validate

cv_result = cross_validate(model, data_numerical, target, cv=5)
cv_result

what does this warning mean and how would I rectify this?

We binarized the original target that was a regression problem to a classification problem. By passing the variable target in your cross-validation, you are preserving the original regression problem using a classifier. I think that you should use goal instead of target in your cross_validate call.