Very slow cross validation and accuracy reproducibility

MarcoDipa · 18 February 2022 17:08

In the solution of exercise M1.05 in the section dealing with “Scaling numerical features” it is evaluated the cross validation accuracy when the pipeline scales the numerical features.
I did the exact same code, but the accuracy reported in the example is mean=0.874 +/- 0.003, a little different from mine mean=0.8733 +/- 0.003.
Is it something I should care of?

It seems to me I wrote the exact same code. I thought I should expect the exact same value.

Also the computational time is quite different (25sec instead of 5 sec, as reported in the example), but I thought it can be caused by different HW (jupiter servere shouldn’t be the same?).

glemaitre58 · 18 February 2022 20:38

Nop, the difference is insignificant.
We might have a slight non-deterministic behaviour since we don’t set the random_state in the HistGradientBoostingClassifier initialization.

Also the computational time is quite different (25sec instead of 5 sec, as reported in the example), but I thought it can be caused by different HW (jupiter servere shouldn’t be the same?).

Yes it will depend on the machine and processor that you get at the time of the exercise.

ThierryParmentelat · 20 February 2022 12:03

I am experiencing the same slowness phenomenon, seeing 48s, 51s and 173s on the 3 runs within the hosted notebooks

clearly it’s the relative slowness that matters here

It might make sense to write a little word of warning at the beginning of the exercise (or of the solution) to encourage people to be patient, especially if they see the 5s score, because one might easily get the feeling that their code is wrong and interrupt the kernel

ogrisel · 21 February 2022 08:19

Indeed, there is probably something unexpected in the configuration of the jupyterhub that makes the notebooks run more slowly than what we anticipated when writing this code. We will investigate.

ogrisel · 21 February 2022 09:48

For information, I opened an issue to track this problem:

github.com/INRIA/scikit-learn-mooc

Unexpected slowness of code execution in the JupyterHub deployment (OpenMP oversubscription)

opened 09:43AM - 21 Feb 22 UTC

ogrisel

FUN-specific

As reported on the forum, the execution of some cells are significantly slower t…han expected (40s or more instead ~2s): https://mooc-forums.inria.fr/moocsl/t/cross-validation-accuracy-reproducibility/7379/3 discussing the cross-validation of an HGB Classifier model in `Exercise M1.05`: https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session02/jump_to_id/2081c92e3a4d4cc3b14db6e7e4220d58 I found the following problem on the configuration of the jupyterhub server: - There are 4 cores on the machine according to `threadpoolctl.threadpool_info()` - However the CFS quota (` /sys/fs/cgroup/cpu/cpu.cfs_quota_us`) is set to 1x the CFS period (`/sys/fs/cgroup/cpu/cpu.cfs_period_us`) which means that only 1 CPU is usable per container. I think we should allow for at least 2 CPUs per-container in the kubernetes CFS config (or even 4), even though we know they will be underused most of the time. And we should set the following environment variables accordingly: ``` OMP_NUM_THREADS=2 OPENBLAS_NUM_THREADS=2 LOKY_CPU_COUNT=2 ``` but if `cfs_quota_us` is left to `1 x cfs_period_us`, then we should instead set: ``` OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 LOKY_CPU_COUNT=1 ``` in the environment config to avoid any potential oversubscription problem. I have also observed that the anti-oversubscription protection for HBG Classifier implemented in https://github.com/scikit-learn/scikit-learn/pull/20477/ and released as part of scikit-learn 1.0 is not working as expected because setting `OMP_NUM_THREADS=1` at the beginning of the notebook or using `threadpoolctl.threadpool_limit(limits=1)` can change the duration from ~40s to ~6s in my tests. So OpenMP oversubscription is the main culprit here. I would not have expected this because `sklearn.utils._openmp_helpers._openmp_effective_n_threads()` returns 1 (as expected) and I have checked that `_openmp_effective_n_threads` is called where appropriate in `HistGradientBoostingClassifier.fit` in the source code of the version of scikit-learn deployed on jupyterhub...

In the mean time you can insert a cell at the beginning of the notebook such as:

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"

and restart the kernel, and execute all the cells from the beginning again and that should make this cross-validation run much faster (less than 6s).

ogrisel · 21 February 2022 15:52

Ok the configuration has been fixed to allow up to 4 CPUs in the JupyterHub containers. Note that jupyter nodes are shared with up to 10 other users. So from time to time you might experience a slight slow-down but most of this time this cell should now take less than 5s to execute.

The new configuration will come in effect once your current session expires (or after you manually stop your jupyter hub server using: https://cloud-mooc.inria.fr/hub/home if you do not want to wait).

About the original question about the accuracy, I confirm that there might be very small random effects because we did not set random_state explicitly for this model.

Thanks Benoit for the fix.

lesteve · 21 February 2022 16:03

Side-comment: if you still see the problem you may need to restart your Jupyter server:

go to cloud-mooc.inria.fr/hub/home
Click on Stop My Server
Click on Start My Server
after this reloading the FUN page should work

To make sure you have the fix, execute this code:

import os
os.environ['OMP_NUM_THREADS']

The output should be 4