Discretizer drastically reduced fit time?

In the lesson “speeding-up gradient_boosting” , you claim that the use of the KBinsDiscretizer in the pipeline before the GradientBoostingRegressor drastically reduced the fit time.
But :

  • without the discretizer the fit time is 6.727 seconds on your server and 5.870 seconds on my computer

  • with the discretizer the fit time is 4.273 seconds on your server and 3.667 seconds on my computer

So only a ~2 seconds improvement, a mere 30% . I will not call that a drastic reduction.
If you confirm the fit times I obtained, and that is not a strange bug, I think you have not to use such a superlative and just say that the discretizer reduced the fit time. It’s less sexy but more true.

PS: for me the true drastic improvement is due to the use of the HistGradientBoostingRegressor at the end of the lesson.

1 Like

Hi, I think it depends on the number of samples in the dataset. The performance difference is greater with very large datasets.

you are probably right but it is said :

Here, we see that the fit time has been drastically reduced

so the drastically applied to the example in the lesson.

Thanks for your comment, I agree we should try to avoid not super clear words like “drastically”, I’ll tag this one for the next MOOC session.

I am not sure KBinsDiscretizer + GradientBoostingRegressor would be that impressive on very large datasets: it still uses an n log(n) sorting operation internally at each split. HistGradientBoostingRegressor on the other end should be much faster on large datasets because it’s using histograms instead of sorting internally. The complexity is therefore linear with the number of samples.

1 Like

Thanks @ogrisel :blush:
My guess was about KBinsDiscretizer+GradientBoostingRegressor vs GradientBoostingRegressor not HistGradientBoostingRegressor.
Theoretically KBinsDiscretizer should make HistGradientBoostingRegressor do fewer splits, but from what you tell me the computational cost on large datasets could be high, right?
I agree with you that HistGradientBoostingRegressor is the most efficient choice in this case.

:+1:
As Oscar Wilde said :

Everything in moderation , including moderation .

That’s working when you eat, drink and when you are writing a MOOC :rofl:

1 Like

HistGradientBoostingRegressor already bins numerical data internally. You can control the number of bins with max_bins (default is 256 which works well in practice). Using KBinsDiscretizer is therefore redundant.

Sorry, I was wrong to write, this is the right question:

“Theoretically KBinsDiscretizer should make GradientBoostingRegressor do fewer splits, but from what you tell me the computational cost on large datasets could be high, right?”

1 Like

Yes, GradientBoostingRegressor would still do the n log(n) sort operations which is expensive when n is large.

1 Like

Solved in Fix wording in HGBDT notebook by glemaitre · Pull Request #508 · INRIA/scikit-learn-mooc · GitHub