Doing the exercie with nested cross-validation

Amiref · 30 March 2022 13:18

I tried to do the exercise in a more generic way and find the the best model results through nested cross validation.

So, after I did all the preprocessing steps, this is my training-validation-test step:

params = {
    "logisticregression__C": loguniform(0.001, 10),
    "columntransformer__standard_scaler__with_mean": (True, False),
    "columntransformer__standard_scaler__with_std": (True, False)
}

# Write your code here.
model_randomised_search = RandomizedSearchCV(model, param_distributions=params, cv=2)
cv_results = cross_validate(model_randomised_search, data, target, cv=5, n_jobs=2, return_estimator=True)
pd.DataFrame(cv_results)

When the last line of my script runs, it shows me a dataframe with 5 rows and five test_scores which are all around 85%; however, when I run model_randomised_search.best_params_ it gives me an attribution error that AttributeError: 'RandomizedSearchCV' object has no attribute 'best_params_'
I cannot understand why do I receive such an error?

P.S. I also did not do any train-test split since I am using the nested cross_validation.

glemaitre58 · 30 March 2022 15:28

To access model_randomised_search.best_params_, you would need to have called model_randomised_search.fit(X, y) at some point. Here, you instead used cross-validation. It means that model_randomised_search was never fit and instead 5 models have been cloned (because cv=5) and fitted on the different data splits.

Since you added return_estimator=True, it means that the 5 models are returned in cv_results. You can access them using cv_results["estimator"]. Therefore you can access to the best estimator for each fold with cv_results["estimator"][fold_idx] where fold_idx can vary between 0 and 4.

Amiref · 31 March 2022 08:38

So, if I understood correctly when I define the RandomizedSearchCV object

model_randomised_search = RandomizedSearchCV(my_model, param_distributions=params, cv=4)

and then call its fit function:

model_randomised_search.fit(data_train, target_train)

the model embedded in the RandomizedSearchCV object (my_model) is trained on data_train (Let’s assume I have splitted data to data_train and data_test) with some combination of parameters. The best combination is selected (for this particular training_set) by keeping the combination leading to the best mean cross-validated score.

So, if I want to have a nested cross-validation, then I should run the fit function, after every CV outer iteration. Otherwise the best_params_ would belong to the model fitted on the manually splitted data_train. So, if what I am saying is correct, then how can I find the best_params_ for each CV outer iteration? and if I am wrong, could you explain it to me how to write a complete bested cross validation?

glemaitre58 · 31 March 2022 10:12

Indeed, you should not call fit but instead pass the model_randomised_search into the cross_validate to get an uncertainty estimation and not only a single point-wise estimation.

I am not sure what you mean by "manually splitter data_train".
Using the figure that you attached, you can see that the outer CV will define 5 models. These models are defined by the grouped blue “testing samples”. These models have a set of hyperparameters that have been tuned on the data_train by selecting the best combination with the highest mean score on the “validation samples”.

The proper nested cross-validation is exactly what you propose earlier:

params = {
    "logisticregression__C": loguniform(0.001, 10),
    "columntransformer__standard_scaler__with_mean": (True, False),
    "columntransformer__standard_scaler__with_std": (True, False)
}

# Write your code here.
model_randomised_search = RandomizedSearchCV(model, param_distributions=params, cv=3)
cv_results = cross_validate(model_randomised_search, data, target, cv=5, n_jobs=2, return_estimator=True)
cv_results = pd.DataFrame(cv_results)

However, as I mentioned, there is not a single combination of hyperparameters but instead, a collection of them defined by the number of splits of the outer CV. And to access them, you just need to write:

n_outer = 3
for fold_idx in range(n_outer):
    cv_results["estimator"][fold_idx].best_params_

An important thing to understand is that the best_params_ might change depending on the fold. Indeed, sometimes there is not the best combination of hyperparameters but several combinations. Due to the randomness of the CV splits, you will get an overview of the best combinations and you can then study how those are varying.