Inner workings of nested cross validation to avoid data leakage

Hi!

I wanted to understand better how are the inner and outer steps of nested cross validation organized as to prevent data leakage.

So, for instance, if a train test split occurs first as to reserve test data for the outer cv and then inside train splits for the inner cv, also if the inner cv splits are made based on the array index - is that how it works?

Thanks in advance!

The outer CV will split the data into a train and a test set and provide only the train set to the inner CV that itself will make again a train/test split. The cross-validation scheme means that the internal CV will be repeating several train/test split on the outer training data and the outer CV will repeat again the process.

Maybe I couldn’t write in the best way, but I wasn’t looking for a general description - I was trying to understand specifics without having to browse through the implementation
Is there a way I could delete the post and ask again?

You can always edit your initial post. This is completely fine.

For the specific, we are indeed using indices. cross_validate and the SearchCV are calling cv.split(X, y) that provide a generator containing the train and test indices. For instance, you can have a look at the KFold.split() documentation: sklearn.model_selection.KFold — scikit-learn 1.0.2 documentation