Very slow cross validation and accuracy reproducibility

In the solution of exercise M1.05 in the section dealing with “Scaling numerical features” it is evaluated the cross validation accuracy when the pipeline scales the numerical features.
I did the exact same code, but the accuracy reported in the example is mean=0.874 +/- 0.003, a little different from mine mean=0.8733 +/- 0.003.
Is it something I should care of?

It seems to me I wrote the exact same code. I thought I should expect the exact same value.

Also the computational time is quite different (25sec instead of 5 sec, as reported in the example), but I thought it can be caused by different HW (jupiter servere shouldn’t be the same?).

Nop, the difference is insignificant.
We might have a slight non-deterministic behaviour since we don’t set the random_state in the HistGradientBoostingClassifier initialization.

Also the computational time is quite different (25sec instead of 5 sec, as reported in the example), but I thought it can be caused by different HW (jupiter servere shouldn’t be the same?).

Yes it will depend on the machine and processor that you get at the time of the exercise.

I am experiencing the same slowness phenomenon, seeing 48s, 51s and 173s on the 3 runs within the hosted notebooks

clearly it’s the relative slowness that matters here

It might make sense to write a little word of warning at the beginning of the exercise (or of the solution) to encourage people to be patient, especially if they see the 5s score, because one might easily get the feeling that their code is wrong and interrupt the kernel :slight_smile:

Indeed, there is probably something unexpected in the configuration of the jupyterhub that makes the notebooks run more slowly than what we anticipated when writing this code. We will investigate.

For information, I opened an issue to track this problem:

In the mean time you can insert a cell at the beginning of the notebook such as:

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"

and restart the kernel, and execute all the cells from the beginning again and that should make this cross-validation run much faster (less than 6s).

Ok the configuration has been fixed to allow up to 4 CPUs in the JupyterHub containers. Note that jupyter nodes are shared with up to 10 other users. So from time to time you might experience a slight slow-down but most of this time this cell should now take less than 5s to execute.

The new configuration will come in effect once your current session expires (or after you manually stop your jupyter hub server using: https://cloud-mooc.inria.fr/hub/home if you do not want to wait).

About the original question about the accuracy, I confirm that there might be very small random effects because we did not set random_state explicitly for this model.

Thanks Benoit for the fix.

Side-comment: if you still see the problem you may need to restart your Jupyter server:

To make sure you have the fix, execute this code:

import os
os.environ['OMP_NUM_THREADS']

The output should be 4