Shuffling trainning data

AlexZambrano · 18 March 2022 18:45

I would like to know if, when dealing with i.i.d, is there a case where shuffling the traininig data could hurt the generalization performance of a trained model?

ogrisel · 21 March 2022 08:55

If the data generating process is really i.i.d., then shuffling the data before training and evaluation should not change the generalization performance of the resulting model in expectation.

However, in practice it can be hard to to make sure that the data is really i.i.d. (that is, that there is no “distribution shift”). Once way to do detect non-i.i.d. data would be to choose an arbitrary index approximately at the middle on the original order of the dataset and try to fit a model that predicts if a sample is before or after that index using cross-validation. If this auxiliary classifier has an accuracy significantly larger than chance level (for instance a ROC AUC significantly larger than 0.5), then we can conclude that the samples after the reference index do not follow exactly the same distribution as the samples before the reference index.

See for instance: 🕵UMP Adversarial Validation | Kaggle

AlexZambrano · 21 March 2022 11:36

Thanks or your answering my question. I hadn’t read about this approach to deal with non i.i.d data (I’m new to DS, ML, CS…) and I find it very clever, that’s a really good reference you gave me there to better understand it. Thanks again!