Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adding estimators_samples_ attribute to forest models #26736

Merged
merged 51 commits into from
Nov 3, 2023

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Jun 30, 2023

Reference Issues/PRs

Fixes: #26716

What does this implement/fix? Explain your changes.

  • Adds the estimators_samples_ property to the BaseForest class, which allows any forest-method to recompute the indices of the training samples per tree in the forest

Any other comments?

Note this is essentially very similar to Bagging* except we only care about the samples considered rather than samples and feature indices.

This is useful for example for potentially allowing the trees to get closer to enabling things like i) honesty and ii) analysis of the trained samples used for the tree.

Signed-off-by: Adam Li <adam2392@gmail.com>
@github-actions
Copy link

github-actions bot commented Jun 30, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ab3be64. Link to the linter CI: here

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 marked this pull request as ready for review June 30, 2023 19:21
@adam2392
Copy link
Member Author

Kay this is ready for review!

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @adam2392!

doc/whats_new/v1.4.rst Show resolved Hide resolved
doc/whats_new/v1.4.rst Outdated Show resolved Hide resolved
sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/_forest.py Show resolved Hide resolved
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @adam2392. A few additional comments.

min_dependency_substitutions.rst Outdated Show resolved Hide resolved
sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved
adam2392 and others added 3 commits August 15, 2023 10:45
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2023

This attribute is already public on Bagging* models. I don't really see the point in implementing something private that we don't use internally in scikit-learn.

The goal of this attribute is to make it easier to inspect the result of the learning process via public API.

+1 for exposing it as a public attribute.

@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2023

@adam2392 maybe you can clarify what kind of use you make of this attribute?

@adam2392
Copy link
Member Author

@adam2392 maybe you can clarify what kind of use you make of this attribute?

@ogrisel I plan on using this attribute for inspecting the training/testing samples used in fitting each individual tree. Right now, this is really hard to do as the fitting the forest process is entirely internal.

More broadly, this would allow someone to implement "honest trees" (ref: #19710) very easily.

  • fit each tree/forest
  • for each tree, get out-of-bag samples using the estimators_indices_ property and use those to re-fit the leaves
  • each tree is now "honest"

adam2392 and others added 2 commits October 24, 2023 09:16
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@adam2392
Copy link
Member Author

Hi I resolved the few docstring comments, but left #26736 (comment) and #26736 (comment) as those seem like they would require changing the behavior in Bagging* too.

Happy to do so though if you think it's warranted.

@adam2392 adam2392 requested a review from ogrisel October 24, 2023 13:19
@adam2392
Copy link
Member Author

Hi @ogrisel just wanted to gently follow-up here to prevent it getting lost.

Is there anything related to the current feature that you think should be still changed?

And a follow-up, do you think the deprecation and renaming of the estimators_samples_ -> training_samples_ API is warranted for a new GH issue for both Bagging* and Forest*?

@glemaitre glemaitre self-requested a review November 3, 2023 13:44
@glemaitre glemaitre changed the title [ENH] Adding estimators_samples_ for forest models ENH Adding estimators_samples_ attribute to forest models Nov 3, 2023
sklearn/ensemble/_forest.py Show resolved Hide resolved
sklearn/ensemble/_forest.py Show resolved Hide resolved
sklearn/ensemble/_forest.py Show resolved Hide resolved
sklearn/ensemble/_forest.py Show resolved Hide resolved
sklearn/ensemble/_forest.py Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM on my side.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM on my side.

adam2392 and others added 3 commits November 3, 2023 12:40
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>

assert isinstance(estimator_samples, list)
assert len(estimators_samples) == len(estimators)
assert estimators_samples[0].dtype == np.int32
Copy link
Member Author

@adam2392 adam2392 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @glemaitre . One part of the if statement led to it being np.int64 instead of np.int32

@adam2392
Copy link
Member Author

adam2392 commented Nov 3, 2023

Otherwise LGTM on my side.

Thanks for the review! I've addressed the comments left by you. Lmk if there's anything I missed.

@glemaitre glemaitre self-requested a review November 3, 2023 18:29
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will merge once the doc build passed.

@glemaitre glemaitre merged commit 3737909 into scikit-learn:main Nov 3, 2023
18 of 25 checks passed
@glemaitre
Copy link
Member

The failures were transient due to GitHub.

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…rn#26736)

Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding estimators_samples_ property to ensemble of tree methods such as RandomForest and ExtraTrees
5 participants