Parameter_tuning_ex03

During the randomised search to find the best hyperparameters for logistic regression I found the best params to be ones with no centering and no scaling:

{‘columntransformer__num_preprocessor__with_mean’: False, ‘columntransformer__num_preprocessor__with_std’: False, ‘logisticregression__C’: 0.8360860766334274}

However in the solution the best hyperparameters are said to be ones with scaled features.

Did I just get unlucky and is this just a consequence of the random search which can miss some good hyperparameter combinations?

I have also observed something a bit puzzling on the same practise

In my first attempt I had gone the lazy path and had computed numerical_columns some other way - using list(set(df.columns)-set(categoriacl_columns))

so only the order changes between both approaches

# like the solution
numerical_columns
→ ['age', 'capital-gain', 'capital-loss', 'hours-per-week']
# some other order
numerical_columns1
→  ['age', 'hours-per-week', 'capital-loss', 'capital-gain']

and the funny thing is I am getting rather different results out of the hypertuning, depending on which of the 2 approaches I take

  • approach using a selector - like the solution -

    {'columntransformer__standard-scale__with_mean': False,
     'columntransformer__standard-scale__with_std': False,
     'logisticregression__C': 1.9965094313871186}
    
  • columns in another order

    {'columntransformer__standard-scale__with_mean': True,
    'columntransformer__standard-scale__with_std': False,
    'logisticregression__C': 0.6266593675236942}
    

I acknowledge that we have no idea how exactly a LogisticRegression works, nor what the C means exactly; but should we be concerned about these results here ?

Thanks for sharing your results, Thierry.

In terms of what the C parameter is in the LinearRegression arguments - it is the regularisation parameter which from my understanding adds a penalty to reduce overfitting of the model to the training set. So a lower C value (stronger regularisation) will prevent the model from overfitting the data too much.

I think the difference in your results is just due to random variation - I wouldn’t read too much into it since the value for C is not that different (i.e. not orders of magnitude). In fact if you look at model.cv_results_ you can see the different cross validation runs and how the value for C changes. I think this might also answer my own question since I have found that the test scores don’t vary too much even for different model parameters:

with_mean with_std C mean_test_score std_test_score
14 False False 2.612979 0.848792 0.005604
15 True True 0.250704 0.848177 0.005119
17 False False 0.861115 0.847871 0.005347
3 False True 0.243126 0.847870 0.005268
4 True True 7.790413 0.847564 0.006152
7 True False 9.475534 0.847461 0.005586
11 True False 1.403738 0.847461 0.006013
13 True False 4.628385 0.847256 0.005436
1 True False 0.065847 0.847051 0.006345
6 True True 0.058459 0.846847 0.005852
9 True True 2.137585 0.846745 0.006131
19 False False 0.09238 0.846335 0.005361
8 False True 0.101718 0.845925 0.005646
0 True True 0.023776 0.842240 0.004636
12 False True 0.017854 0.840602 0.005042
5 True True 0.017263 0.839885 0.005432
18 False False 0.011741 0.839783 0.006341
16 False True 0.013222 0.838452 0.006854
2 False False 0.008776 0.834357 0.005470
10 False True 0.001022 0.782555 0.002566

Yep, there is something fishy there. We are investigating through the following threads: Exercise M3.02: scaling