Extracting cross_validate scoring metric & order of ColumnTransformer

tanh_lines · 19 June 2021 15:18

(1) How do I extract the metric of cross_validate?

I am new to Python and throughout the assignments I had been struggling to dig through the structure to find the data I need. dir() is not as informative as str() in R, which is what I am used to.

When I was doing Q4, I realised I had a different answer when I set scoring = 'r2' vs when I didn’t specify it, but then I don’t know what is the default used.

(2) Why does the order of ColumnTransformer affect results?

I realised by chance that I couldn’t get 0.74 because I had the order of numerical and categorical switched. I am not sure but I suppose this affected the results because when it is numerical then categorical, the numerical_transformer is only applied on numerical columns, as opposed to all columns when the categorical columns are transformed/encoded first?

preprocessor = ColumnTransformer(
transformers = [
    ('categorical', categorical_transformer, categorical_features),
    ('numerical', numerical_transformer, numerical_features)
])

Thanks!!

glemaitre58 · 19 June 2021 21:01

If the predictor used is a classifier, accuracy is used by default, if the predictor is a regressor, r2 score is used. It is something to know in scikit-learn because there no way to extract this information indeed. However, you can always pass the scoring method to the scoring parameter to be sure.

I would need to know the last predictor. But this is most probably linked to randomness.
Some estimator uses randomness during the learning phase. Sometimes, we use random_state to be sure that we get a deterministic behaviour. However, changing the order of the processing will give rise to a different dataframe (the same data but the columns are in different orders) and thus the randomness will be different even if the seed has been fixed.

tanh_lines · 21 June 2021 17:10

Here is my full code, which gives a test score of 0.720 +/- 0.087. Swapping the order of processing gives the anticipated answer of 0.742.

from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_selector as selector

categorical_features = selector(dtype_include=object)(data)
numerical_features = selector(dtype_exclude=object)(data)

numerical_transformer = SimpleImputer()

categorical_transformer = make_pipeline(SimpleImputer(strategy="constant", fill_value="missing"),
                                        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))

preprocessor = ColumnTransformer(
transformers = [
    ('numerical', numerical_transformer, numerical_features),
    ('categorical', categorical_transformer, categorical_features)
])

#transformers = [
#        ('categorical', categorical_transformer, categorical_features),
#    ('numerical', numerical_transformer, numerical_features)
#])

preprocessor

# final pipeline
tree = make_pipeline(preprocessor, DecisionTreeRegressor(random_state = 0))

# CV fit
cv_results = cross_validate(tree, 
                           data, target,
                           cv = 10,
                           #scoring = 'r2', 
                           return_train_score = True,
                           return_estimator = True,
                           n_jobs = 2)

print(f"Tree regressor train score: {cv_results['train_score'].mean():.3f} +/- {cv_results['train_score'].std():.3f}")
print(f"Tree regressor test score: {cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")

I didn’t know that changing the order of the columns despite having the same data would affect results! Do models all pick a ‘starting point’/go through the columns arbitrarily, from left-to-right in a dataframe, or in some other more targeted fashion?

And is there a function in sklearn similar to bake() or juice() in tidymodels to view the transformed dataset?

Thanks!

glemaitre58 · 21 June 2021 17:32

Decision trees will explore feature in a random fashion. It will not be the case, with some other model like the logistic regression with a non-stochastic solver like LBFGS for instance.

glemaitre58 · 21 June 2021 17:34

You can use your preprocessor with fit and transform (I am 100% sure because I never used tidymodels but it looks like what it does).