Q6 Help ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Hello, i suppose i have a problem with the categorical data, but:

  • how can i debug it ? what data are involved ?

  • i think i have a problem with the replacement of missing values but i dont know where ?

Thank you for help

My code is here:

# -------------Processing of categorical data - Traitement des données de catégories
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

#remplacement valeurs manquantes 

imp_freq = SimpleImputer(strategy="most_frequent")
imp_freq.fit(data_categorical)

# Replace missing values in data - remplacement dans le tableau initial des colonnes de catégories avec les données manquantes
data[categorical_columns] = imp_freq.transform(data_categorical)

# ----------- Processing of numerical data - Traitement des données numériques
numerical_features = [
  "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
  "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
  "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
  "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
  "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]
numerical_preprocessor = StandardScaler()
data_numerical = data[numerical_features]

# building pipeline and cross validation
preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_features)])
model_global = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
cv_results = cross_validate(model_global, data, target, cv=5)
print(cv_results)
scores = cv_results["test_score"]
print(f"The mean global cross-validation accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

This is not replacing the missing values. This is only learning which values is the most frequent. In ColumnTransformer, you can pass a transformation that is a scikit-learn Pipeline and not a single transformer.

For instance you can define:

numerical_preprocessor = make_pipeline(
    StandardScaler(),
    SimpleImputer(strategy="mean"),
)

and pass it in the ColumnTransformer as you did. The values will be scaled (omitting missing values) and then imputed. This processing will be applied only on the numerical_features.

Thank you for your answer,i tried this for categorical data but now i have this problem:

“ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)”.

An extract of the code is:

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())])

numerical_preprocessor = StandardScaler()

# ---------------------building pipeline and cross validation

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_transformer, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_features)])

model_global = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
cv_results = cross_validate(model_global, data, target, cv=5)

It means that you still have some missing values.

You are not handling missing data for the numerical features. I would double-check if there are some missing values in those numerical columns. I think that we were giving the following instruction for the numerical data:

Now create a predictive model that uses these numerical columns as input data. Your predictive model should be a pipeline composed of a standard scaler, a mean imputer (cf. sklearn.impute.SimpleImputer(strategy="mean")) and a sklearn.linear_model.LogisticRegression.

Hi! I had exactly the same error as you (after using the same code as you). After an hour of unsuccessful search I gave up and looked up the answer. It turns out that changing the way I defined categorical_columns solved the problem.
For some reason that I still don’t understand, defining the categorical columns using selector doesn’t work (also, it gives you fewer columns as the method used in the solutions).
In the instructions it said we should use all features that were not in numerical_features, so you have to use all features of data minus the numerical_features.
Too bad we didn’t see the method used in the solution in the course. Ended up loosing a lot of time thinking that my pipeline was wrong…

1 Like

I’m sorry but i can’t do it. I followed your advice, my pipeline is like that now first column for categorical features, second one for numerical features:

but i have an error for each run of the cross validation

ValueError: Found unknown categories [‘Metal’, ‘Membran’] in column 14 during transform

It means that you encounter category during testing that you did not encounter during training. One strategy would be to encode these categories with only zeros. To achieve this behaviour you need to create a one-hot encoder with the parameter OneHotEncoder(handle_unknown="ignore")

This parameter was presented in the notebook “Encoding of categorical variables” in the section “Evaluate our predictive pipeline”

2 Likes

Hi Olga, I am reassured to know that I am not the only one to sweat on this exercise … I am always looking for the solution…it’s like that that we can learn…

EXCELLENT , It works !!! Thank you Guillaume.