Parameter_tuning_ex03

ismailmo1 · 25 February 2022 22:47

During the randomised search to find the best hyperparameters for logistic regression I found the best params to be ones with no centering and no scaling:

{‘columntransformer__num_preprocessor__with_mean’: False, ‘columntransformer__num_preprocessor__with_std’: False, ‘logisticregression__C’: 0.8360860766334274}

However in the solution the best hyperparameters are said to be ones with scaled features.

Did I just get unlucky and is this just a consequence of the random search which can miss some good hyperparameter combinations?

ThierryParmentelat · 26 February 2022 16:51

I have also observed something a bit puzzling on the same practise

In my first attempt I had gone the lazy path and had computed numerical_columns some other way - using list(set(df.columns)-set(categoriacl_columns))

so only the order changes between both approaches

# like the solution
numerical_columns
→ ['age', 'capital-gain', 'capital-loss', 'hours-per-week']
# some other order
numerical_columns1
→  ['age', 'hours-per-week', 'capital-loss', 'capital-gain']

and the funny thing is I am getting rather different results out of the hypertuning, depending on which of the 2 approaches I take

approach using a selector - like the solution -

{'columntransformer__standard-scale__with_mean': False,
 'columntransformer__standard-scale__with_std': False,
 'logisticregression__C': 1.9965094313871186}

columns in another order

{'columntransformer__standard-scale__with_mean': True,
'columntransformer__standard-scale__with_std': False,
'logisticregression__C': 0.6266593675236942}

I acknowledge that we have no idea how exactly a LogisticRegression works, nor what the C means exactly; but should we be concerned about these results here ?

ismailmo1 · 26 February 2022 18:57

Thanks for sharing your results, Thierry.

In terms of what the C parameter is in the LinearRegression arguments - it is the regularisation parameter which from my understanding adds a penalty to reduce overfitting of the model to the training set. So a lower C value (stronger regularisation) will prevent the model from overfitting the data too much.

I think the difference in your results is just due to random variation - I wouldn’t read too much into it since the value for C is not that different (i.e. not orders of magnitude). In fact if you look at model.cv_results_ you can see the different cross validation runs and how the value for C changes. I think this might also answer my own question since I have found that the test scores don’t vary too much even for different model parameters:

	with_mean	with_std	C	mean_test_score	std_test_score
14	False	False	2.612979	0.848792	0.005604
15	True	True	0.250704	0.848177	0.005119
17	False	False	0.861115	0.847871	0.005347
3	False	True	0.243126	0.847870	0.005268
4	True	True	7.790413	0.847564	0.006152
7	True	False	9.475534	0.847461	0.005586
11	True	False	1.403738	0.847461	0.006013
13	True	False	4.628385	0.847256	0.005436
1	True	False	0.065847	0.847051	0.006345
6	True	True	0.058459	0.846847	0.005852
9	True	True	2.137585	0.846745	0.006131
19	False	False	0.09238	0.846335	0.005361
8	False	True	0.101718	0.845925	0.005646
0	True	True	0.023776	0.842240	0.004636
12	False	True	0.017854	0.840602	0.005042
5	True	True	0.017263	0.839885	0.005432
18	False	False	0.011741	0.839783	0.006341
16	False	True	0.013222	0.838452	0.006854
2	False	False	0.008776	0.834357	0.005470
10	False	True	0.001022	0.782555	0.002566

glemaitre58 · 28 February 2022 14:14

Yep, there is something fishy there. We are investigating through the following threads: Exercise M3.02: scaling