Validation and learning curves

Harshit_Sati · 3 June 2021 07:26

Hey, while going through the Effect of the sample size in cross-validation module,
I could not gather what n_jobs parameter in learning_curve() does and why was it 2?
and what purpose does plt.xscale(“log”) serve, moreover why are we scaling the x axis?

glemaitre58 · 3 June 2021 07:40

n_jobs is to parallelize the operations. It corresponds to the number of CPU cores that you want to make available for this computation.

For the learning_curve, you want to train multiple models where each model will be fitted on a set of different sizes. So each training could be done independently and this is a typical case of where we can use “Embarrassingly Parallelization”: Embarrassingly parallel - Wikipedia (scikit-learn does it for you).

So here, you can pass n_jobs=2 and we would expect an acceleration of almost x2. If you have many cores on a machine, you could increase this number. Here, we set it at 2 because, it will be the number of cores that we make available on the server where the notebooks are run.

It is to get a logarithmic x-axis. You can remove it to see the plotting difference. Without the logspace, it will be more difficult to see the changes happening in low sample values compared to high sample values. However, this is just for visualization purposes.

Harshit_Sati · 3 June 2021 07:51

Thank you so much for the speedy reply