-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX VotingClassifier handles class weights #19753
base: main
Are you sure you want to change the base?
FIX VotingClassifier handles class weights #19753
Conversation
hey @MaxwellLZH, thanks for the PR. Could you write some unit tests on the changes? |
Hi @azihna , I've added a test case to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @MaxwellLZH !
My suggestion in #18550 (comment) would not work for nested meta-estimators. The following would error:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
log_reg = LogisticRegression(multi_class='multinomial', random_state=1)
rf = RandomForestClassifier(n_estimators=50, random_state=1,
class_weight={1: 2, 2: 3})
rf_pipe = make_pipeline(PCA(), rf)
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
eclf1 = VotingClassifier(estimators=[
('lr', log_reg), ('rf', rf_pipe)], voting='hard')
eclf1 = eclf1.fit(X, y)
Given the current inheritance structure, we most likely need to
- override
_validate_estimators
- use
get_params
to look for every key that ends withclass_weight
- If there is a key that ends with
class_weight
clone the estimator, and update to the new mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter class_weight
that has a completely different meaning than us. In that case, the automatically mapping may break their code.
sklearn/ensemble/_voting.py
Outdated
continue | ||
is_clone = False | ||
for k, v in clf.get_params(deep=True).items(): | ||
if k.endswith('class_weight') and v is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v
can be a string. For example, LogisticRegression
's class_weight can be 'balanced'
. We would need guard against it. I also prefer to use the following style so we have less indented lines:
if not k.endswith('class_weight') or not isinstance(v, Mapping):
continue
if not already_cloned:
clf = clone(clf)
already_cloned = True
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas, I've updated the code to check against the string argument values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter
class_weight
that has a completely different meaning than us. In that case, the automatically mapping may break their code.
Hi Thomas, I'm thinking maybe we could check if the estimator is a third party one, and if so, we could raise a warning if they have a parameter class_weight
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, we clearly define class_weights in the glossary so we can assume third party estimators use the same definition.
The underlying issue with passing information from a metaestimator to inner estimators is very common in sklearn. In this case, we are trying to use the classes information from the metaestimator to re-encode the class_weights
in the inner estimator.
I am unable to think of a good alternative to this PR without adding more API. I'm thinking broadly to see if we want to apply the same approach to other meta-estimators that have the same issue, (StackingClassifier
).
I would propose to go a little further here. We would need a common test to solve the problem for all ensemble methods where the problem exists. I propose the following test that could be added in @pytest.mark.parametrize(
"classifier",
[
StackingClassifier(
estimators=[
("lr", LogisticRegression()),
("svm", LinearSVC()),
("rf", RandomForestClassifier()),
]
),
VotingClassifier(
estimators=[
("lr", LogisticRegression()),
("svm", LinearSVC()),
("rf", RandomForestClassifier()),
]
),
BaggingClassifier(base_estimator=LogisticRegression()),
AdaBoostClassifier(base_estimator=LogisticRegression()),
]
)
def test_ensemble_support_class_weight(classifier):
"""Check that nested `class_weight` are encoded by the meta-estimators."""
iris = load_iris()
X, y = iris.data, iris.target_names[iris.target].astype(object)
class_weight = {
target_name: weight + 1
for weight, target_name in enumerate(iris.target_names)
}
classifier = clone(classifier)
if classifier._required_parameters:
for nested_estimator in getattr(classifier, "estimators"):
nested_estimator[1].set_params(class_weight=class_weight)
else:
classifier.set_params(
base_estimator=classifier.base_estimator.set_params(
class_weight=class_weight
)
)
classifier.fit(X, y) Thus, we should make the same type of @thomasjpfan WDYT? |
I am okay with something like this. With this any meta-estimator that encodes the target, must re-encode |
@MaxwellLZH would you be able to carry on this PR? |
|
I've opened another PR #22039 to work on all the ensemble classifiers, closing this one . |
Reference Issues/PRs
This is a fix to issue #18550
What does this implement/fix? Explain your changes.
We look into the base estimators and encode the
class_weights
if they are provided.