Exercise M1.04

Ahoi hoi folks,

there’s fair chance that this was only a problem for me, but I think the task was not entirely clear and could be explained in more detail.

HTH, cheers, Peer

Yeah I agree that this would need a bit more guidance to help people tackling the exercise.

Maybe it is just a matter of reusing the text we already have and split it in multiple cells.

To do so, let’s try to use OrdinalEncoder to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
LogisticRegression. The statistical performance of the pipeline can be
evaluated as usual by cross-validation and then compared to the score
obtained when using OneHotEncoder or to some other baseline score.

Because OrdinalEncoder can raise errors if it sees an unknown category at
prediction time, you can set the handle_unknown and unknown_value
parameters.

The multiple cells could be:

  • define an OrdinalEncoder + handle_unknown
  • define a LogisticRegression
  • use a pipeline to combine both
  • now estimate statistical performance with cross-validation and reflect on it

Solve in https://github.com/INRIA/scikit-learn-mooc/pull/334