Extracting cross_validate scoring metric & order of ColumnTransformer

(1) How do I extract the metric of cross_validate?

I am new to Python and throughout the assignments I had been struggling to dig through the structure to find the data I need. dir() is not as informative as str() in R, which is what I am used to.

When I was doing Q4, I realised I had a different answer when I set scoring = 'r2' vs when I didn’t specify it, but then I don’t know what is the default used.

(2) Why does the order of ColumnTransformer affect results?

I realised by chance that I couldn’t get 0.74 because I had the order of numerical and categorical switched. I am not sure but I suppose this affected the results because when it is numerical then categorical, the numerical_transformer is only applied on numerical columns, as opposed to all columns when the categorical columns are transformed/encoded first?

preprocessor = ColumnTransformer(
transformers = [
    ('categorical', categorical_transformer, categorical_features),
    ('numerical', numerical_transformer, numerical_features)
])

Thanks!!

If the predictor used is a classifier, accuracy is used by default, if the predictor is a regressor, r2 score is used. It is something to know in scikit-learn because there no way to extract this information indeed. However, you can always pass the scoring method to the scoring parameter to be sure.

I would need to know the last predictor. But this is most probably linked to randomness.
Some estimator uses randomness during the learning phase. Sometimes, we use random_state to be sure that we get a deterministic behaviour. However, changing the order of the processing will give rise to a different dataframe (the same data but the columns are in different orders) and thus the randomness will be different even if the seed has been fixed.

Here is my full code, which gives a test score of 0.720 +/- 0.087. Swapping the order of processing gives the anticipated answer of 0.742.

from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.compose import make_column_selector as selector

categorical_features = selector(dtype_include=object)(data)
numerical_features = selector(dtype_exclude=object)(data)

numerical_transformer = SimpleImputer()

categorical_transformer = make_pipeline(SimpleImputer(strategy="constant", fill_value="missing"),
                                        OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))

preprocessor = ColumnTransformer(
transformers = [
    ('numerical', numerical_transformer, numerical_features),
    ('categorical', categorical_transformer, categorical_features)
])

#transformers = [
#        ('categorical', categorical_transformer, categorical_features),
#    ('numerical', numerical_transformer, numerical_features)
#])

preprocessor

# final pipeline
tree = make_pipeline(preprocessor, DecisionTreeRegressor(random_state = 0))

# CV fit
cv_results = cross_validate(tree, 
                           data, target,
                           cv = 10,
                           #scoring = 'r2', 
                           return_train_score = True,
                           return_estimator = True,
                           n_jobs = 2)

print(f"Tree regressor train score: {cv_results['train_score'].mean():.3f} +/- {cv_results['train_score'].std():.3f}")
print(f"Tree regressor test score: {cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")

I didn’t know that changing the order of the columns despite having the same data would affect results! Do models all pick a ‘starting point’/go through the columns arbitrarily, from left-to-right in a dataframe, or in some other more targeted fashion?

And is there a function in sklearn similar to bake() or juice() in tidymodels to view the transformed dataset?

Thanks!

Decision trees will explore feature in a random fashion. It will not be the case, with some other model like the logistic regression with a non-stochastic solver like LBFGS for instance.

1 Like

You can use your preprocessor with fit and transform (I am 100% sure because I never used tidymodels but it looks like what it does).

1 Like