Spliting dataset in memory and not in local storage?

So, the train-test split by sklearn is done all along in the memory after the whole data set is copied to memory?

Once the dataset is loaded in the memory, the train-test split creates arrays of indices which correspond to locations inside the full dataset and that will be designed for training or testing, respectively.

Does that answer the question?

Actually it train_test_split does create copies, the same reason numpy with advanced indexing creates copies see Copies and views — NumPy v1.23 Manual. OK maybe if you use shuffle=False, it does not create copy not 100% sure.

Hard to tell exactly, but I think the question was just to know whether it was doing it in memory or not (maybe the user thought it was creating files not sure) so maybe edit your answer to remove the mention of copies?

Thanks Arturo for the edit, let’s wait if that answers the user’s question!

1 Like