Exercise M1.04

peerherholz · 27 April 2021 14:31

Ahoi hoi folks,

there’s fair chance that this was only a problem for me, but I think the task was not entirely clear and could be explained in more detail.

HTH, cheers, Peer

lesteve · 28 April 2021 04:58

Yeah I agree that this would need a bit more guidance to help people tackling the exercise.

Maybe it is just a matter of reusing the text we already have and split it in multiple cells.

To do so, let’s try to use OrdinalEncoder to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
LogisticRegression. The statistical performance of the pipeline can be
evaluated as usual by cross-validation and then compared to the score
obtained when using OneHotEncoder or to some other baseline score.

Because OrdinalEncoder can raise errors if it sees an unknown category at
prediction time, you can set the handle_unknown and unknown_value
parameters.

The multiple cells could be:

define an OrdinalEncoder + handle_unknown
define a LogisticRegression
use a pipeline to combine both
now estimate statistical performance with cross-validation and reflect on it

glemaitre58 · 6 May 2021 12:21

Solve in https://github.com/INRIA/scikit-learn-mooc/pull/334

lfarhi · 10 May 2021 15:41