Discretizer drastically reduced fit time?

echidne · 30 June 2021 09:24

In the lesson “speeding-up gradient_boosting” , you claim that the use of the KBinsDiscretizer in the pipeline before the GradientBoostingRegressor drastically reduced the fit time.
But :

without the discretizer the fit time is 6.727 seconds on your server and 5.870 seconds on my computer
with the discretizer the fit time is 4.273 seconds on your server and 3.667 seconds on my computer

So only a ~2 seconds improvement, a mere 30% . I will not call that a drastic reduction.
If you confirm the fit times I obtained, and that is not a strange bug, I think you have not to use such a superlative and just say that the discretizer reduced the fit time. It’s less sexy but more true.

PS: for me the true drastic improvement is due to the use of the HistGradientBoostingRegressor at the end of the lesson.

nickprock · 30 June 2021 10:49

Hi, I think it depends on the number of samples in the dataset. The performance difference is greater with very large datasets.

echidne · 30 June 2021 12:15

you are probably right but it is said :

Here, we see that the fit time has been drastically reduced

so the drastically applied to the example in the lesson.

lesteve · 30 June 2021 17:35

Thanks for your comment, I agree we should try to avoid not super clear words like “drastically”, I’ll tag this one for the next MOOC session.

ogrisel · 30 June 2021 17:45

I am not sure KBinsDiscretizer + GradientBoostingRegressor would be that impressive on very large datasets: it still uses an n log(n) sorting operation internally at each split. HistGradientBoostingRegressor on the other end should be much faster on large datasets because it’s using histograms instead of sorting internally. The complexity is therefore linear with the number of samples.

nickprock · 1 July 2021 14:38

Thanks @ogrisel
My guess was about KBinsDiscretizer+GradientBoostingRegressor vs GradientBoostingRegressor not HistGradientBoostingRegressor.
Theoretically KBinsDiscretizer should make HistGradientBoostingRegressor do fewer splits, but from what you tell me the computational cost on large datasets could be high, right?
I agree with you that HistGradientBoostingRegressor is the most efficient choice in this case.

echidne · 1 July 2021 14:56

As Oscar Wilde said :

Everything in moderation , including moderation .

That’s working when you eat, drink and when you are writing a MOOC

ogrisel · 2 July 2021 07:20

HistGradientBoostingRegressor already bins numerical data internally. You can control the number of bins with max_bins (default is 256 which works well in practice). Using KBinsDiscretizer is therefore redundant.

nickprock · 2 July 2021 12:39

Sorry, I was wrong to write, this is the right question:

“Theoretically KBinsDiscretizer should make GradientBoostingRegressor do fewer splits, but from what you tell me the computational cost on large datasets could be high, right?”

ogrisel · 2 July 2021 15:49

Yes, GradientBoostingRegressor would still do the n log(n) sort operations which is expensive when n is large.

ArturoAmorQ · 31 January 2022 15:54

Solved in Fix wording in HGBDT notebook by glemaitre · Pull Request #508 · INRIA/scikit-learn-mooc · GitHub