Quiz M7.01 strange choice of exemple

echidne · 2 July 2021 19:42

Hi,
I ll give some indication of the correct answer so please feel free to supress after answering.
You ask which cross validation not to choice for a set a data containing patients status coming from 10 hospitals and to predict if a subject has a disease or not.

Since I suppose you were talking about a particular disease I considered that the classes were unbalanced and we had to prefer stratified cross validation and not use the others.
But in the solution you say we have to consider more the fact patients are coming from 10 different hospitals.
Is grouping really more important than unbalanced classes?
When I look your exemple in the lesson, grouping had a very weak impact on the prediction.
Why not consider the hospitals as parameters among others and use them to predict the illness of the patients if really grouping is so important in that case?

Thank in advance for your answers.

bayesian · 3 July 2021 15:23

I was also thinking that the different hospitals would not necessarily matter, because each hospital would see patients of a variety of profiles.

glemaitre58 · 5 July 2021 09:47

Hospitals really matter. To give a concrete example: depending on the brand of MRI and the type (1.5T vs. 3T) you will have a huge variation of the quality of the image. It becomes even more relevant when datasets are required from different modalities that make the variation even broader. Such cases are commonly seen; for instance one of them that we worked on: IMPAC - Imaging-psychiatry challenge: predicting autism

One is not more important than the other Both are important if you want to be sure to get a good estimate of the performance of the model.

Yes, the problem was quite an easy one. Imagine that you get people from all around the world that would have a different way of writing the number 7 (US vs. French style). Then, it might become to be slightly more challenging depending on how much data you could collect. It will get worse, when the problem at hand will get difficult.

To take into consideration both balancing and grouping, scikit-learn will provide a new cross-validation object in the next release: sklearn.model_selection.StratifiedGroupKFold — scikit-learn 1.0.dev0 documentation

echidne · 5 July 2021 10:29

Given a dataset containing records from subjects from 10 different hospitals, we would like to predict if a subject has a disease or not. Which cross-validation strategy we should not use?

In the question you never mention it was about image !!! how can we guess?

An enormous majority of the data are taken by different writers, differents tools, differents days, etc …
Do you think that we have to always use grouping strategies?

glemaitre58 · 5 July 2021 10:54

I try to give a concrete example with images to help you!!! (I am not sure why we are using so much of these exclamations marks apart that it pisses me off). However, this is just an additional example for this forum topic to be as much didactic as possible.

Indeed, it could be anything: different machines that analyze blood components or whatever you think is good.

Whenever you can, yes.

echidne · 5 July 2021 11:19

I apologize about the ‘!!!’
It s due to my mediterranean tendance to emphasize but it was not to upset you.

The message is taken. I did several MOOC on scikit learn and read papers using it and never encountered grouping strategies so its new to me

glemaitre58 · 5 July 2021 11:35

Just something additional on this topic after reading another of your question/remarks. Basically, you want to take groups into account when you are aware (this is sometimes difficult) that they might ease the problem at hand.

ogrisel · 5 July 2021 12:56

The problem is that you might never know what kind of variety is lacking from your dataset. If the goal is to build a system that can make good predictions on any kind groups such as hospitals, it’s better to measure the ability to generalize across such groups by using group-aware cross-validation.

For instance, it could be the case that different hospitals have significant biases in the population of their patients (age, socio economic backgrounds, genetics, devices, diagnostic and treatment habits of the healthcare providers…). Analyzing the error structure of inter-hospital-based CV makes it possible to detect this kind of biases that can potentially significantly impact the performance and robustness of your modeling pipeline.

bayesian · 5 July 2021 13:21

Is there perhaps a guiding heuristic when considering what to specify as a grouping versus, say, including it as a categorical variable? The one that immediately comes to mind is if one is trying to generalise past the limited groups/categories currently available in the data modelled on, but I was wondering if there’s potentially more ambiguous situations?

echidne · 5 July 2021 13:57

Am I correct to think that boostraping based classifier or regressor should be robust to that kind of unknown sampling bias?

ogrisel · 5 July 2021 14:07

Including the hospital id as a categorical variable will not help your model predict any better on data from future unseen hospitals, quite the opposite.

The goal here is to ensure that you measure the right kind of generalization performance in your CV loop.

If you want to build an optical character recognition system, and claim that its performance is not significantly impacted by the writing style you need to evaluate it on data written by people who are not part of the training set.
If you build a system to diagnose some disease on IRM images and claim that its prediction error is not impacted significantly by the model or manufacturers of the IRM device, you need to make evaluate it on medical images recorded on devices by manufacturers that are not present in the training set.

What is the important here is the nature of the claim you are making with the result of the CV procedure.

ogrisel · 5 July 2021 14:12

No model is can ever be 100% immune to dataset bias. Bagging models are generally better calibrated than the base estimators they are built upoin but this is another matter.

But if a training set does not cover enough combinations of features that are important to make good predictions on the test set, no data-driven procedure will be able to fix that.

echidne · 5 July 2021 14:27

ok. I was naively thinking that resampling data as by boostraping should erase sampling biais. I found several autors that are saying that boostraping is the solution to avoid sampling bias.
I understand that grouping allow us to bring to light hidden data structures and I was thinking that boostraping (since they are resampling the data) should hide under the rug that kind of problem and should be avoided for that reason.

Fiona_A · 11 July 2021 14:27

Unrelated to what has been saying but for this quiz question. How i was reading it I thought it was just one answer.

“Which cross-validation strategy we should not use”
i think should read:
“which cross-validation strategy(s) should we not use”

I was incorrect with this question either way but the question was making me think there was only one answer not multiple.

lesteve · 6 January 2022 15:12

I think this quiz problem has been fixed, the wording is now:

Which cross-validation strategies are the most adequate to evaluate the ability of the model to make good predictions on patients from unseen hospitals.

The rest is more a general discussion without that many actionable points …

So removing the priority-mooc-v2 for this one.