M1 wrap up Quiz Q6 program error

fhi62 · 24 May 2021 17:02

Hi,
got an error running my program.
Can somebody help as I supose I can not post my lines here ?

“The mean cross-validation accuracy is: nan +/- nan
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last): File “/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py”, line 598, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py”, line 341, in fit
Xt = self._fit(X, y, **fit_params_steps)…”

glemaitre58 · 24 May 2021 17:09

Can you provide the full snippet of code to understand why it fails?
Also add error_score="raise" in cross_validate(...) or cross_val_score(...) to get the full traceback and provide this information as well.

fhi62 · 24 May 2021 17:14

Not sure that was allowed:

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate


ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
ames_housing = ames_housing.drop(columns="Id")

target_name = "SalePrice"
data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]
target = (target > 200_000).astype(int)

# selectionner les colonnes numericales
numerical_columns = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal"]

data_numeric = data[numerical_columns]
# selectionner les colonnes categorical

categorical_columns = data.drop(columns=["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal"])

#prepare numeric data replacing empty by most frequent

#Imp_most=SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#data_numeric=Imp_most.fit_transform(data_numeric)
scaler_imputer_transformer = make_pipeline(StandardScaler(), SimpleImputer(missing_values=np.nan, strategy='most_frequent'))
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

#preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),('standard-scaler', numerical_preprocessor, numerical_columns)])
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),('standard-scaler', scaler_imputer_transformer, numerical_columns)])
#separer en train & test sets (utile ?)

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

#treat with logistic regression

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
#test phase before calculating details


from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42)



# do crossvalidation
cv_results = cross_validate(model, data, target, cv=5)
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

glemaitre58 · 24 May 2021 17:17

Can you provide the full traceback as well.

fhi62 · 24 May 2021 17:18

you are super fast

The mean cross-validation accuracy is: nan +/- nan
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in _validate_remainder
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in <genexpr>
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 268, in _determine_key_type
    raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

  warnings.warn("Estimator fit failed. The score on this train-test"
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in _validate_remainder
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in <genexpr>
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 268, in _determine_key_type
    raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

  warnings.warn("Estimator fit failed. The score on this train-test"
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in _validate_remainder
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in <genexpr>
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 268, in _determine_key_type
    raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

  warnings.warn("Estimator fit failed. The score on this train-test"
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in _validate_remainder
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in <genexpr>
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 268, in _determine_key_type
    raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

  warnings.warn("Estimator fit failed. The score on this train-test"
/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:615: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 505, in fit_transform
    self._validate_remainder(X)
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in _validate_remainder
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py", line 324, in <genexpr>
    self._has_str_cols = any(_determine_key_type(cols) == 'str'
  File "/opt/conda/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 268, in _determine_key_type
    raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

  warnings.warn("Estimator fit failed. The score on this train-test"
```

glemaitre58 · 24 May 2021 17:25

The error is the following:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

It is raised by the ColumnTransformer. It means that the provided column variable are wrong. So looking at your code, we have:

numerical_columns = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal"]

categorical_columns = data.drop(columns=["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1",
 "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", 
"BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea",
 "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea", 
"MiscVal"])

While numerical_columns is a list of column names, categorical_columns is not. data.drop(columns=[...]) will return a dataframe and not the name of the categorical columns.

I would suggest the solution proposed there: How did you build the non-numerical columns? - #9 by aigle81

fhi62 · 24 May 2021 17:48

Thanks a lot, I will adjust
I also changed the sequence of operations to obtain this :

Is it ok ?

glemaitre58 · 24 May 2021 17:55

It looks meaningful

fhi62 · 24 May 2021 18:03

Thanks a lot. Just realized this is tougher than I thought but brings lot of value going deeper

nanfuka · 2 July 2021 05:44

nanfuka · 2 July 2021 05:45

I keep getting a similar error
" No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed"

ThomasLoock · 2 July 2021 11:10

Hi,
there´s an error in your construction of the preprocessor.
You use “data_numeric” but you meant to use “numerical_features”, don´t you?

nanfuka · 3 July 2021 05:34

I changed the numeric_features to data_numeric because that was the label I had used earlier for the numerical _column.

glemaitre58 · 5 July 2021 09:10

@nanfuka you need to pass the name of the columns and not the filtered data. To be explicit:

data = pd.read_csv(...)
categorical_columns = ["Col_A", "Col_B"]
data_categorical = data[categorical_columns]

You need to provide categorical_columns to the ColumnTransformer and not the data itself data_categorical.

nanfuka · 10 July 2021 12:40

Noted