Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Replace loss module HGBT #20811

Merged
merged 155 commits into from Jan 11, 2022
Merged

Conversation

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Aug 22, 2021

Reference Issues/PRs

Follow-up of #20567.

What does this implement/fix? Explain your changes.

This PR replaces the losses of HGBT in sklearn/ensemble/_hist_gradient_boosting with the new common loss module of #20567.

Any other comments?

Similar to #19089, but only HGBT.
This PR is based on 7bee26b.
Edit: #20567 was merged 🚀

@lorentzenchr
Copy link
Member Author

@lorentzenchr lorentzenchr commented Dec 5, 2021

I can't reproduce the circleci failure locally. /home/circleci/project/doc/whats_new/v1.1.rst:338: WARNING: Title underline too short.'

@lorentzenchr
Copy link
Member Author

@lorentzenchr lorentzenchr commented Dec 11, 2021

Marking as high priority because it enables #21800 very easily.

sklearn/_loss/tests/test_loss.py Outdated Show resolved Hide resolved
sklearn/_loss/loss.py Show resolved Hide resolved
if self._loss.constant_hessian:
self._loss.gradient(
y_true=y_train,
raw_prediction=raw_predictions.T,
Copy link
Member

@thomasjpfan thomasjpfan Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I thought there would have been some cpu cache performance issue with either "C" or "F" order. (Depending on which axis prange is iterating over)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Overall looks good. There is some overhead with remembering that raw_predictions has shape: (n_trees_per_iteration, n_samples) and gradient and hessians has shape: (n_samples, n_trees_per_iteration).

Was there a reason that BaseLoss to prefer (n_samples, n_trees_per_iteration) over the reverse?

@lorentzenchr
Copy link
Member Author

@lorentzenchr lorentzenchr commented Dec 18, 2021

Overall looks good. There is some overhead with remembering that raw_predictions has shape: (n_trees_per_iteration, n_samples) and gradient and hessians has shape: (n_samples, n_trees_per_iteration).

Was there a reason that BaseLoss to prefer (n_samples, n_trees_per_iteration) over the reverse?

TBH, that's just my preferred way of looking at it. This way, raw_predictions, X, y, predict() and predict_proba all have samples on the first axis (axis=0).

I could try to make it consistent and change all HGBT functions to use raw_prediction.shape=(n_samples, n_trees_per_iteration).

Edit: 27e818f is an attempt to do so. @thomasjpfan what do you think?

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Dec 18, 2021

Edit: 27e818f is an attempt to do so. @thomasjpfan what do you think?

Yea, I think everything is much nicer now. (The diff is surprising smaller than I thought it would be.)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

LGTM

sklearn/_loss/loss.py Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

Thank you for this PR, @lorentzenchr.

Here are a few comments.

sklearn/_loss/loss.py Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

LGTM.

Edit: #22173 has been opened for a follow-up.

@lorentzenchr
Copy link
Member Author

@lorentzenchr lorentzenchr commented Jan 11, 2022

2 approvals 🎉
It seems custom that a reviewer merges. Is there a passage in the developer guideline that I missed or is it just custom to do so?

@thomasjpfan thomasjpfan changed the title [MRG] ENH Replace loss module HGBT ENH Replace loss module HGBT Jan 11, 2022
doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

I did one more pass. This looks good for merging. Thank you for working on this!

@thomasjpfan thomasjpfan merged commit 4e974e0 into scikit-learn:main Jan 11, 2022
31 checks passed
@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jan 11, 2022

Is there a passage in the developer guideline that I missed or is it just custom to do so?

I think it's mostly a custom. I do not see it documented anywhere.

@lorentzenchr lorentzenchr deleted the loss_module_hgbt branch Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants