How to work with the dictionary from cross_validate

sgorsse · 21 May 2022 09:52

Hi,
I tried to make a parallel plot to inspect the effects of hyperparameter values as an exercice of the notebook “Evaluation and hyperparameter tuning”, just after the command cv_results = cross_validate(model_grid_search, ...).
However, when I use the commande cv_results[column_results] as in the previous notebook, I got error message: unhashable type: ‘list’
If I try to extract parameter values using cv_results['estimator']['classifier__learning_rate'], I got the error message: list indices must be integers or slices, not str.
Any help would be welcome to handle the dictionary obtained after cv_results = cross_validate(model_grid_search, ...)
Thanks
Stéphane

sgorsse · 23 May 2022 08:13

Dear pedagogical team, since this is the last day the forum is open, I was hoping you would be willing to answer the above question before it closes. I take the opportunity to thank you all for this enlightening mooc. Best regards. Stéphane

ArturoAmorQ · 23 May 2022 09:50

If you run a cell with just

cv_results['estimator']

you will get a list with length equal to the number of folds of the outer cross-validation. A list can be accessed with integer indexes, for instance

cv_results['estimator'][0]

will output the estimator from the first fold of the outer cross-validation, in this case, a GridSearchCV. Then you can access all the methods from such estimator directly, for example:

cv_results['estimator'][0].best_params_

outputs the best parameters found by the GridSearchCV in the first fold of the outer cross-validation. This is a dictionary, which can now be accessed with the respective keys:

cv_results['estimator'][0].best_params_['classifier__learning_rate']

Notice that instead of manually iterating through folds, the last cell of the Evaluation and hyperparameter tuning notebook uses a for loop:

for cv_fold, estimator_in_fold in enumerate(cv_results["estimator"]):
    print(
        f"Best hyperparameters for fold #{cv_fold + 1}:\n"
        f"{estimator_in_fold.best_params_}"
    )

I hope that answers the question.

sgorsse · 23 May 2022 10:48

Thank you, it is very helpful. Just one last thing: I’d like to visualize all the results from grid search using a parallel plot. In other words, I am not interested by the best parameters only, but all the test score from all combinations of parameters as map by grid search. How?

ArturoAmorQ · 23 May 2022 13:25

One way to do it is

import plotly.express as px
from collections import defaultdict

cv_results_to_plot = defaultdict(list)

for estimator in cv_results["estimator"]:
    for cv_fold, params in enumerate(estimator.cv_results_["params"]):
        cv_results_to_plot["learning_rate"].append(params['classifier__learning_rate'])
        cv_results_to_plot["max_leaf_nodes"].append(params['classifier__max_leaf_nodes'])
        cv_results_to_plot['mean_test_score'].append(estimator.cv_results_['mean_test_score'][cv_fold])

fig = px.parallel_coordinates(
    cv_results_to_plot,
    color="mean_test_score",
    dimensions=["learning_rate", "max_leaf_nodes", "mean_test_score"],
    color_continuous_scale=px.colors.diverging.Tealrose,
)
fig.show()

sgorsse · 23 May 2022 14:41

Many thanks, you have been very helpful.
Cheers
Stéphane

vinorda · 29 May 2022 02:02

You could also grab hold of test scores of the inner grid search CV as follows

for cv_fold, estimator_in_fold in enumerate(cv_results["estimator"]):
    print(f"Outer CV fold {cv_fold}")
    for i in range(4):
        split = 'split'+str(i)+'_test_score'
        print(f"{estimator_in_fold.cv_results_[split]}")

As mentioned in notebooks, the best_params could be different combinations in each of the outer CV and in that case we can deploy all the models/estimators found by the outer cross-validation loop and make them vote to get the final predictions, which makes sense too.