Group as an extra feature?

mbomben · 4 January 2023 14:53

Dear all,
would it be a good strategy to include the group as an extra feature instead of using a k-fold shuffle with non-overlapping groups, please?

Many thanks in advance and best regards,
Marco Bomben

lesteve · 5 January 2023 07:07

You can read this answer from a previous MOOC session : https://mooc-forums.inria.fr/moocsl/t/quiz-m7-01-strange-choice-of-exemple/4327/11

mbomben · 6 January 2023 08:42

I cannot log in while I can log in into this mooc. What should I do, please?

Best regards,
Marco Bomben

lesteve · 9 January 2023 07:46

Here is a copy and paste of the original message, not sure why you can’t see it …

Including the hospital id as a categorical variable will not help your model predict any better on data from future unseen hospitals, quite the opposite.

The goal here is to ensure that you measure the right kind of generalization performance in your CV loop.

If you want to build an optical character recognition system, and claim that its performance is not significantly impacted by the writing style you need to evaluate it on data written by people who are not part of the training set.
If you build a system to diagnose some disease on IRM images and claim that its prediction error is not impacted significantly by the model or manufacturers of the IRM device, you need to make evaluate it on medical images recorded on devices by manufacturers that are not present in the training set.

What is the important here is the nature of the claim you are making with the result of the CV procedure.