Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX VotingClassifier handles class weights #19753

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

MaxwellLZH
Copy link
Contributor

Reference Issues/PRs

This is a fix to issue #18550

What does this implement/fix? Explain your changes.

We look into the base estimators and encode the class_weights if they are provided.

@azihna
Copy link
Contributor

azihna commented Apr 1, 2021

hey @MaxwellLZH, thanks for the PR. Could you write some unit tests on the changes?

@MaxwellLZH
Copy link
Contributor Author

hey @MaxwellLZH, thanks for the PR. Could you write some unit tests on the changes?

Hi @azihna , I've added a test case to check VotingClassifier works for string labels .

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @MaxwellLZH !

My suggestion in #18550 (comment) would not work for nested meta-estimators. The following would error:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

log_reg = LogisticRegression(multi_class='multinomial', random_state=1)
rf = RandomForestClassifier(n_estimators=50, random_state=1,
                             class_weight={1: 2, 2: 3})
rf_pipe = make_pipeline(PCA(), rf)

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

eclf1 = VotingClassifier(estimators=[
        ('lr', log_reg), ('rf', rf_pipe)], voting='hard')
eclf1 = eclf1.fit(X, y)

Given the current inheritance structure, we most likely need to

  1. override _validate_estimators
  2. use get_params to look for every key that ends with class_weight
  3. If there is a key that ends with class_weight clone the estimator, and update to the new mapping.

sklearn/ensemble/_voting.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_voting.py Outdated Show resolved Hide resolved
sklearn/ensemble/tests/test_voting.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter class_weight that has a completely different meaning than us. In that case, the automatically mapping may break their code.

sklearn/ensemble/_voting.py Outdated Show resolved Hide resolved
continue
is_clone = False
for k, v in clf.get_params(deep=True).items():
if k.endswith('class_weight') and v is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v can be a string. For example, LogisticRegression's class_weight can be 'balanced'. We would need guard against it. I also prefer to use the following style so we have less indented lines:

if not k.endswith('class_weight') or not isinstance(v, Mapping):
    continue

if not already_cloned:
    clf = clone(clf)
    already_cloned = True
....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas, I've updated the code to check against the string argument values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter class_weight that has a completely different meaning than us. In that case, the automatically mapping may break their code.

Hi Thomas, I'm thinking maybe we could check if the estimator is a third party one, and if so, we could raise a warning if they have a parameter class_weight. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, we clearly define class_weights in the glossary so we can assume third party estimators use the same definition.

The underlying issue with passing information from a metaestimator to inner estimators is very common in sklearn. In this case, we are trying to use the classes information from the metaestimator to re-encode the class_weights in the inner estimator.

I am unable to think of a good alternative to this PR without adding more API. I'm thinking broadly to see if we want to apply the same approach to other meta-estimators that have the same issue, (StackingClassifier).

@glemaitre glemaitre self-requested a review July 23, 2021 18:49
@glemaitre
Copy link
Member

I would propose to go a little further here. We would need a common test to solve the problem for all ensemble methods where the problem exists. I propose the following test that could be added in sklearn/ensemble/test_common.py:

@pytest.mark.parametrize(
    "classifier",
    [
        StackingClassifier(
            estimators=[
                ("lr", LogisticRegression()),
                ("svm", LinearSVC()),
                ("rf", RandomForestClassifier()),
            ]
        ),
        VotingClassifier(
            estimators=[
                ("lr", LogisticRegression()),
                ("svm", LinearSVC()),
                ("rf", RandomForestClassifier()),
            ]
        ),
        BaggingClassifier(base_estimator=LogisticRegression()),
        AdaBoostClassifier(base_estimator=LogisticRegression()),
    ]
)
def test_ensemble_support_class_weight(classifier):
    """Check that nested `class_weight` are encoded by the meta-estimators."""
    iris = load_iris()
    X, y = iris.data, iris.target_names[iris.target].astype(object)
    class_weight = {
        target_name: weight + 1
        for weight, target_name in enumerate(iris.target_names)
    }

    classifier = clone(classifier)
    if classifier._required_parameters:
        for nested_estimator in getattr(classifier, "estimators"):
            nested_estimator[1].set_params(class_weight=class_weight)
    else:
        classifier.set_params(
            base_estimator=classifier.base_estimator.set_params(
                class_weight=class_weight
            )
        )

    classifier.fit(X, y)

Thus, we should make the same type of class_weight re-encoding for all these estimators. I am not sure 100% but I think that it should be possible to add the re-encoding in BaseEnsemble (that has a base_estimator attribute to be cloned) and _BaseHeterogeneousEnsemble (that has an estimators attribute to be cloned). The inheritance should do the rest I assume.

@thomasjpfan WDYT?

@thomasjpfan
Copy link
Member

Thus, we should make the same type of class_weight re-encoding for all these estimators.

I am okay with something like this. With this any meta-estimator that encodes the target, must re-encode class_weights of its base estimators.

@glemaitre
Copy link
Member

@MaxwellLZH would you be able to carry on this PR?

@glemaitre glemaitre removed their request for review December 16, 2021 08:56
@MaxwellLZH
Copy link
Contributor Author

@MaxwellLZH would you be able to carry on this PR?
Hi @glemaitre , sorry for leaving this PR for a long time, I will work on the PR over the weekend .

@MaxwellLZH
Copy link
Contributor Author

I've opened another PR #22039 to work on all the ensemble classifiers, closing this one .

@MaxwellLZH MaxwellLZH closed this Dec 21, 2021
@thomasjpfan thomasjpfan reopened this Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants