Size error on question 12

MDer · 23 March 2022 21:35

Unfortunately, I could not plot coef_ magnitude as the size of features did not correspond to the size of coef_… I can’t find my mistake. Here is my code :

numerical_selector = selector(dtype_exclude=object)
categorical_selector = selector(dtype_exclude=object)
preprocessor=ColumnTransformer([
    ('onehotencoder',OneHotEncoder(handle_unknown="ignore"),categorical_selector(data)),
    ('standarscaler',StandardScaler(),numerical_selector(data))
])
model=make_pipeline(preprocessor,LogisticRegression(max_iter=1000))

cv_result=cross_validate(model,data,target,cv=10,return_estimator=True)

categorical_columns=categorical_selector(data)
numerical_columns=numerical_selector(data)
preprocessor.fit(data)
feature_names = (preprocessor.named_transformers_["onehotencoder"]
                             .get_feature_names_out(categorical_columns)).tolist()
feature_names += numerical_columns

coefs = [est[-1].coef_ for est in cv_result["estimator"]]

weights = pd.DataFrame(coefs[0], columns=feature_names)

Here I got an error :

ValueError: Shape of passed values is (1, 394), indices imply (1, 396)

and I checked :
coefs[0].shape => (1, 394)
len(feature_names) => 396

What’s wrong ???

Thanks for any help !
Marc

glemaitre58 · 24 March 2022 09:47

The second line is not correct. You want to include object as categorical variables.
Now you are encoding numerical value and the number of categories (which is not meaningful) differs between splits.

ArturoAmorQ · 24 March 2022 09:51

Here you seem to be using the same dtype_exclude to select the numerical and categorical variables in your first two lines:

numerical_selector = selector(dtype_exclude=object)
categorical_selector = selector(dtype_exclude=object)

Also you don’t need to pass the data when defining the preprocessor

preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown="ignore"), categorical_columns),
    (StandardScaler(), numerical_columns),
)

ArturoAmorQ · 24 March 2022 09:51

Oh! I just noticed @glemaitre58 just provided his answer as well, sorry for the duplicate.

MDer · 25 March 2022 16:21

Thanks a lot !
I didn’t noticed that error ; I thought about a shape problem and did not get back to the pipeline as there was no error.

Thanks !
Marc

pritamdodeja · 4 April 2022 16:00

Hello, for one of the cross validation runs, I’m getting a shape of (105,) for the coefficients, which is messing things up for me. Here is the code I’ve written:


from sklearn.compose import make_column_selector as selector  
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import numpy as np                                                                                                                                     
categorical_selector = selector(dtype_include=object, )                                                                                                
numerical_selector = selector(dtype_include=np.number, )                                                                                               
numerical_columns = numerical_selector(data)                                                                                                           
categorical_columns = categorical_selector(data)                                                                                                       
from sklearn.preprocessing import OneHotEncoder                                                                                                        
numerical_preprocessor = StandardScaler()                                                                                                              
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")                                                                                      
preprocessor = ColumnTransformer([("numerical_preprocessor", numerical_preprocessor, numerical_columns ), ("categorical_preprocessor", categorical_preprocessor, categorical_columns )], n_jobs=-1, )                                                                                                         
categorical_and_numerical_model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))                                                        
linear_cross_validate_results_all = cross_validate(estimator=categorical_and_numerical_model, X=data, y=target, cv=10, n_jobs=-1, return_estimator=True, error_score="raise")       
list_of_estimators_all = [pipeline[-1] for pipeline in linear_cross_validate_results_all["estimator"]]                                                 
list_of_coefficients_new = [estimator.coef_[0] for estimator in list_of_estimators_all]                                                                
list_of_coefficients_new[7].shape

ArturoAmorQ · 5 April 2022 09:16

If you run the following line of code after the snippet you provided:

pd.DataFrame(list_of_coefficients_new).shape

you will get the correct shape (10, 106). The reason is that you have one NaN

pd.DataFrame(list_of_coefficients_new).isna().sum()

This should not be a problem when plotting or evaluating the cross-validated scores to answer the rest of the quiz, but in the future (next scikit-learn release), we will have a more dedicated strategy for these infrequent categories by setting handle_unknown="infrequent_if_exist" as mentioned in
this forum comment.