Inner workings of nested cross validation to avoid data leakage

cassiasampaio · 7 March 2022 10:45

Hi!

I wanted to understand better how are the inner and outer steps of nested cross validation organized as to prevent data leakage.

So, for instance, if a train test split occurs first as to reserve test data for the outer cv and then inside train splits for the inner cv, also if the inner cv splits are made based on the array index - is that how it works?

Thanks in advance!

glemaitre58 · 7 March 2022 11:46

The outer CV will split the data into a train and a test set and provide only the train set to the inner CV that itself will make again a train/test split. The cross-validation scheme means that the internal CV will be repeating several train/test split on the outer training data and the outer CV will repeat again the process.

cassiasampaio · 7 March 2022 13:56

Maybe I couldn’t write in the best way, but I wasn’t looking for a general description - I was trying to understand specifics without having to browse through the implementation
Is there a way I could delete the post and ask again?

glemaitre58 · 7 March 2022 14:13

You can always edit your initial post. This is completely fine.

For the specific, we are indeed using indices. cross_validate and the SearchCV are calling cv.split(X, y) that provide a generator containing the train and test indices. For instance, you can have a look at the KFold.split() documentation: sklearn.model_selection.KFold — scikit-learn 1.0.2 documentation