Question on Wrap-Up Quiz 3

This is my code for Question 3. The question is asking us to evaluate the mean_test_score for different set of parameter from option a - b

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer


all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox"),
]

from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

model5 = make_pipeline(KNeighborsClassifier())

param_grid ={'preprocessor': all_preprocessors, 'classifier__n_neighbors': (5, 51, 101)}

model5_grid_search = GridSearchCV(model5, param_grid=param_grid, scoring='balanced_accuracy', cv=10)

There is an error when I run the following code:

model5_grid_search.fit(data, target)

The error is (it is a bit long so I just paste the ErrorValue statement):

ValueError: Invalid parameter classifier for estimator Pipeline(steps=[('kneighborsclassifier', KNeighborsClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.`

May I know what is my error about?

Secondly, when do we need to evaluate the model performance (particularly classification problem) using test_score, mean_test_score and standard_test_score?

You forgot to add the preprocessor in the make_pipeline. Then, regarding the naming of the different steps, you can use model.get_params() to know the string to provide in param_grid. For instance, make_pipeline automatically genearate naming based on the name of the class. Thus, you will not have a parameter named classifier__n_neighbors but instead kneighborsclassifier__n_neighbors.

One needs to use Pipeline instead of make_pipeline if you want to choose the name of the steps.

You will need to look at the mean and standard deviation of the scores to know if the scores overlap (considering the std. dev.). Here the key is:

Let us consider that a model is significantly better than another if the its mean test score is better than the mean test score of the alternative by more than the standard deviation of its test score.

1 Like

Hi, thanks for the reply.

I have made some change on the code however, there is still error.

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
 
 
all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox"),
]

data_train, data_test, target_train, target_test = train_test_split(data, target, random_state=42)
 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

model5 = make_pipeline(all_preprocessors, KNeighborsClassifier())

The error is:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-9-573ee05904a1> in <module>
          5 from sklearn.pipeline import make_pipeline
          6 
    ----> 7 model5 = make_pipeline(all_preprocessors, KNeighborsClassifier())

    /opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py in make_pipeline(memory, verbose, *steps)
        727     p : Pipeline
        728     """
    --> 729     return Pipeline(_name_estimators(steps), memory=memory, verbose=verbose)
        730 
        731 

    /opt/conda/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
         61             extra_args = len(args) - len(all_args)
         62             if extra_args <= 0:
    ---> 63                 return f(*args, **kwargs)
         64 
         65             # extra_args > 0

    /opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
        116         self.memory = memory
        117         self.verbose = verbose
    --> 118         self._validate_steps()
        119 
        120     def get_params(self, deep=True):

    /opt/conda/lib/python3.9/site-packages/sklearn/pipeline.py in _validate_steps(self)
        166             if (not (hasattr(t, "fit") or hasattr(t, "fit_transform")) or not
        167                     hasattr(t, "transform")):
    --> 168                 raise TypeError("All intermediate steps should be "
        169                                 "transformers and implement fit and transform "
        170                                 "or be the string 'passthrough' "

    TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '[None, StandardScaler(), MinMaxScaler(), QuantileTransformer(n_quantiles=100), PowerTransformer(method='box-cox')]' (type <class 'list'>) doesn't

You don’t want to pass all preprocessor but instead a preprocessor with a specific name such that you can change it with param_grid:

model = Pipeline(steps=[
     ("preprocessor", None), ("classifier", KNeighborsClassifier())
])

and then define param_grid:

param_grid = {
    "preprocessor": all_preprocessors,
    "classifier__n_neighbors": [1, 2, 3, 4, 5]
}
1 Like

I got it but all my mean_test_score and standard_test_score are 0.943 +/- 0.000 regardless I use no preprocessor and no n_neighbors vs StandardScaler and n_neighbors=5

Here is StandardScaler and n_neighbors=5 code, I can change the preprocessor and also the n_neighbors to test:

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
#from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

#data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.2, random_state=42)

all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox")]

model6 = Pipeline(steps=[('preprocessor', StandardScaler()), ('classifier',KNeighborsClassifier(n_neighbors=5))])
 
param_grid = {'preprocessor': all_preprocessors, 'classifier__n_neighbors': [5, 51, 101]}
 
model6_grid_search = GridSearchCV(model6, param_grid=param_grid, n_jobs=4, scoring ='balanced_accuracy', cv=10)
 
model6_cv = cross_validate(model6_grid_search, data, target, cv=10)
 
model6_cv_result = model6_cv['test_score'].mean()
print(f'The cross-validation accuracy score with no preprocessor and n_neighbours=5 is {model6_cv_result.mean():.3f} +/- '
       f'{model6_cv_result.std():.3f}')

Besides, may I know does the cv=10 in GridSearchCV and cross_validate steps must be same? My understanding is both cv in each step can be different as it is the number of cv times to perform for each GridSearchCV and cross_validate.

You are computing the standard deviation of the mean score and not the standard deviation of all scores returned by the cross-validation.

As you mentioned they can be different. There is nothing forcing you to have the same number of fold in the inner and outer cross-validation since they are indenpendent.

Ok, noted. However, after removed the mean to calculate the mean and standard deviation during the print statement, all mean and standard deviation with any of the preprocessing also getting the same mean and standard deviation at 0.943 +/- 0.036, respectively.

Looking back to the corrected answer b and d, I cannot get the option d as both set of code are giving me the same mean and deviation.

Code for n_neighbors=51 and StandardScaler:

all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox")]

model6 = Pipeline(steps=[('preprocessor', StandardScaler()), ('classifier',KNeighborsClassifier(n_neighbors=51))])

param_grid = {'preprocessor': all_preprocessors, 'classifier__n_neighbors': [5, 51, 101]}

model6_grid_search = GridSearchCV(model6, param_grid=param_grid, n_jobs=4, scoring ='balanced_accuracy', cv=10)
 
model6_cv = cross_validate(model6_grid_search, data, target, cv=10)
 
model6_cv_result = model6_cv['test_score']
print(f'The cross-validation accuracy score with standardScaler preprocessor and n_neighbours=51 is {model6_cv_result.mean():.7f} +/- '
      f'{model6_cv_result.std():.7f}')

Code for n_neighbors=101 and StandardScaler:

all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox")]

model6 = Pipeline(steps=[('preprocessor', StandardScaler()), ('classifier',KNeighborsClassifier(n_neighbors=101))])
 
param_grid = {'preprocessor': all_preprocessors, 'classifier__n_neighbors': [5, 51, 101]}
 
model6_grid_search = GridSearchCV(model6, param_grid=param_grid, n_jobs=4, scoring ='balanced_accuracy', cv=10)
 
model6_cv = cross_validate(model6_grid_search, data, target, cv=10)
 
model6_cv_result = model6_cv['test_score']
print(f'The cross-validation accuracy score with standardScaler preprocessor and n_neighbours=101 is {model6_cv_result.mean():.7f} +/- '
      f'{model6_cv_result.std():.7f}')

What might have gone wrong in here?

I just paste the question ask in the wrap-up:

Use sklearn.model_selection.GridSearchCV to study the impact of the choice of the preprocessor and the number of neighbors on the 10-fold cross-validated balanced_accuracy metric. We want to study the n_neighbors in the range [5, 51, 101] and preprocessor in the range all_preprocessors.

Let us consider that a model is significantly better than another if the its mean test score is better than the mean test score of the alternative by more than the standard deviation of its test score.

We do not ask for wrapping the grid-search in an external cross-validation here: we don’t want to get an evaluation of the best model but rather compare the results between the different models. In short, you only need to fit the grid-search and look at the fitted attribute cv_results_ to make the requested analysis.

You do not need a different code. You vary the number of neighbors of the classifier by adding it in the param_grid. Checking the cv_results_ attribute of the grid-search, you will see the performance of the each combination of parameters.

I am able to get this table. I hope I am doing the correct thing now.

May I know with the mean and standard deviation, to consider one model is better than the other model, should we take the mean - standard deviation to compare? The smaller the standard deviation the better as the value is not away too far from the mean.

So, in option (D) The model with n_neighbors=51 and StandardScaler is significantly better than the model with n_neighbors=101 and StandardScaler

n_neighbors=51 and StandardScaler (0.927273 - 0.051731 = 0.875542

n_neighbors=101andStandardScaler` (0.821162 - 0.075347 = 0.745815

Thus, n_neighbors=51 and StandardScaler is better than n_neighbors=101andStandardScaler`

For option (C), c) The model with n_neighbors=5 and StandardScaler is significantly better than the model with n_neighbors=51 and StandardScaler .

n_neighbors=5 and StandardScaler (0.953939 - 0.041800 = 0.912139)
n_neighbors=51 and StandardScaler (0.927273 - 0.051731 = 0.875542)
In this case, it should be correct but this option is not being selected as one of the solution.

You have to look at the overlap of the distribution thus:
0.953939 +/- 0.041800 and 0.927273 +/- 0.051731

So here, 0.927273 + 0.051731 is bigger than 0.953939 so it means that the improvement is not significant.

Thank you very much for your explanation. :slight_smile:

Sorry to reopen the question but I think that really you have to be carefull in the use of “significantly better” in your questions.
Here you say to @Alvin19:

You have to look at the overlap of the distribution thus:
0.953939 +/- 0.041800 and 0.927273 +/- 0.051731

And I’m agree with you than the 2 distributions are overlaping indeed and then there is no significant difference between the 2 means (at least in absence of an adapted statistical test)

But If I take the question D and compare the results @Alvin obtained then the 2 distributions ( 0.927 +/- 0.051 vs 0.821 +/- 0.075) are overlaping since 0.821 + 0.075 > 0.927 - 0.051.
It’s even worse with the results I got since I obtained 0.956 +/- 0.027 vs 0.918 +/- 0.032.

I think the correct answer (to stay in scope with your solution) would be to not talk about overlaping distributions but to stick to your defintion of “significantly better” in the question : mean_worst_score + std_worst_score < mean _best_score

I think the way to know which one is significantly better than the other one is by looking at the mean_test-score then the std_test_score.

We need to take 1 digit to compare between the mean_test_score if it is higher than the other one then we can label it as significant better. If both set of mean_test_score are similar then we look at the std_test_score (the smaller the better).