FIX VotingClassifier handles class weights #19753

MaxwellLZH · 2021-03-23T09:52:37Z

Reference Issues/PRs

This is a fix to issue #18550

What does this implement/fix? Explain your changes.

We look into the base estimators and encode the class_weights if they are provided.

azihna · 2021-04-01T12:59:40Z

hey @MaxwellLZH, thanks for the PR. Could you write some unit tests on the changes?

MaxwellLZH · 2021-04-03T08:08:17Z

hey @MaxwellLZH, thanks for the PR. Could you write some unit tests on the changes?

Hi @azihna , I've added a test case to check VotingClassifier works for string labels .

thomasjpfan

Thank you for working on this @MaxwellLZH !

My suggestion in #18550 (comment) would not work for nested meta-estimators. The following would error:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

log_reg = LogisticRegression(multi_class='multinomial', random_state=1)
rf = RandomForestClassifier(n_estimators=50, random_state=1,
                             class_weight={1: 2, 2: 3})
rf_pipe = make_pipeline(PCA(), rf)

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])

eclf1 = VotingClassifier(estimators=[
        ('lr', log_reg), ('rf', rf_pipe)], voting='hard')
eclf1 = eclf1.fit(X, y)

Given the current inheritance structure, we most likely need to

override _validate_estimators
use get_params to look for every key that ends with class_weight
If there is a key that ends with class_weight clone the estimator, and update to the new mapping.

sklearn/ensemble/_voting.py

sklearn/ensemble/tests/test_voting.py

…ss_weights

thomasjpfan

I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter class_weight that has a completely different meaning than us. In that case, the automatically mapping may break their code.

sklearn/ensemble/_voting.py

thomasjpfan · 2021-04-08T22:08:32Z

sklearn/ensemble/_voting.py

+                continue
+            is_clone = False
+            for k, v in clf.get_params(deep=True).items():
+                if k.endswith('class_weight') and v is not None:


v can be a string. For example, LogisticRegression's class_weight can be 'balanced'. We would need guard against it. I also prefer to use the following style so we have less indented lines:

if not k.endswith('class_weight') or not isinstance(v, Mapping): continue if not already_cloned: clf = clone(clf) already_cloned = True ....

Hi Thomas, I've updated the code to check against the string argument values.

I am trying to think of ways where lead to unexpected or surprising behavior. There is a weird case where a third party estimator can name a parameter class_weight that has a completely different meaning than us. In that case, the automatically mapping may break their code.

Hi Thomas, I'm thinking maybe we could check if the estimator is a third party one, and if so, we could raise a warning if they have a parameter class_weight. WDYT?

Looking at this again, we clearly define class_weights in the glossary so we can assume third party estimators use the same definition.

The underlying issue with passing information from a metaestimator to inner estimators is very common in sklearn. In this case, we are trying to use the classes information from the metaestimator to re-encode the class_weights in the inner estimator.

I am unable to think of a good alternative to this PR without adding more API. I'm thinking broadly to see if we want to apply the same approach to other meta-estimators that have the same issue, (StackingClassifier).

glemaitre · 2021-07-23T19:53:09Z

I would propose to go a little further here. We would need a common test to solve the problem for all ensemble methods where the problem exists. I propose the following test that could be added in sklearn/ensemble/test_common.py:

@pytest.mark.parametrize(
    "classifier",
    [
        StackingClassifier(
            estimators=[
                ("lr", LogisticRegression()),
                ("svm", LinearSVC()),
                ("rf", RandomForestClassifier()),
            ]
        ),
        VotingClassifier(
            estimators=[
                ("lr", LogisticRegression()),
                ("svm", LinearSVC()),
                ("rf", RandomForestClassifier()),
            ]
        ),
        BaggingClassifier(base_estimator=LogisticRegression()),
        AdaBoostClassifier(base_estimator=LogisticRegression()),
    ]
)
def test_ensemble_support_class_weight(classifier):
    """Check that nested `class_weight` are encoded by the meta-estimators."""
    iris = load_iris()
    X, y = iris.data, iris.target_names[iris.target].astype(object)
    class_weight = {
        target_name: weight + 1
        for weight, target_name in enumerate(iris.target_names)
    }

    classifier = clone(classifier)
    if classifier._required_parameters:
        for nested_estimator in getattr(classifier, "estimators"):
            nested_estimator[1].set_params(class_weight=class_weight)
    else:
        classifier.set_params(
            base_estimator=classifier.base_estimator.set_params(
                class_weight=class_weight
            )
        )

    classifier.fit(X, y)

Thus, we should make the same type of class_weight re-encoding for all these estimators. I am not sure 100% but I think that it should be possible to add the re-encoding in BaseEnsemble (that has a base_estimator attribute to be cloned) and _BaseHeterogeneousEnsemble (that has an estimators attribute to be cloned). The inheritance should do the rest I assume.

@thomasjpfan WDYT?

thomasjpfan · 2021-07-24T23:09:43Z

Thus, we should make the same type of class_weight re-encoding for all these estimators.

I am okay with something like this. With this any meta-estimator that encodes the target, must re-encode class_weights of its base estimators.

glemaitre · 2021-12-16T08:56:30Z

@MaxwellLZH would you be able to carry on this PR?

MaxwellLZH · 2021-12-16T14:07:51Z

@MaxwellLZH would you be able to carry on this PR?
Hi @glemaitre , sorry for leaving this PR for a long time, I will work on the PR over the weekend .

MaxwellLZH · 2021-12-21T09:34:22Z

I've opened another PR #22039 to work on all the ensemble classifiers, closing this one .

sklearn/ensemble/_voting.py

FIX VotingClassifier handles class weights

b71522f

github-actions bot added the module:ensemble label Mar 23, 2021

FIX add test cases for VotingClassifier

d1af167

thomasjpfan reviewed Apr 3, 2021

View reviewed changes

sklearn/ensemble/_voting.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_voting.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_voting.py Outdated Show resolved Hide resolved

MaxwellLZH added 2 commits April 8, 2021 09:55

remove pandas dependency in tests

3ffc048

overwrite _validate_estimators to make VotingClassifier work with cla…

4a21707

…ss_weights

thomasjpfan reviewed Apr 8, 2021

View reviewed changes

check if parameter value is string

7f14dc2

glemaitre self-requested a review July 23, 2021 18:49

glemaitre added 2 commits July 23, 2021 21:03

Merge remote-tracking branch 'origin/main' into pr/MaxwellLZH/19753

fedf94f

trigger black

d86b35f

glemaitre removed their request for review December 16, 2021 08:56

MaxwellLZH mentioned this pull request Dec 21, 2021

WIP Ensemble Classifiers handles class weights #22039

Open

MaxwellLZH closed this Dec 21, 2021

thomasjpfan reviewed Dec 21, 2021

View reviewed changes

sklearn/ensemble/_voting.py Show resolved Hide resolved

thomasjpfan reopened this Dec 21, 2021

MaxwellLZH added 3 commits April 23, 2022 14:59

code refactor

b0c086b

merge main & fix merge conflict

f1c9d42

fix when clf is set drop

4dbd702

MaxwellLZH requested a review from thomasjpfan April 28, 2022 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX VotingClassifier handles class weights #19753

FIX VotingClassifier handles class weights #19753

MaxwellLZH commented Mar 23, 2021

azihna commented Apr 1, 2021

MaxwellLZH commented Apr 3, 2021

thomasjpfan left a comment

thomasjpfan left a comment

thomasjpfan Apr 8, 2021

MaxwellLZH Apr 13, 2021

MaxwellLZH Jun 15, 2021

thomasjpfan Jun 19, 2021

glemaitre commented Jul 23, 2021

thomasjpfan commented Jul 24, 2021

glemaitre commented Dec 16, 2021

MaxwellLZH commented Dec 16, 2021

MaxwellLZH commented Dec 21, 2021

FIX VotingClassifier handles class weights #19753

Are you sure you want to change the base?

FIX VotingClassifier handles class weights #19753

Conversation

MaxwellLZH commented Mar 23, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

azihna commented Apr 1, 2021

MaxwellLZH commented Apr 3, 2021

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Apr 8, 2021

Choose a reason for hiding this comment

MaxwellLZH Apr 13, 2021

Choose a reason for hiding this comment

MaxwellLZH Jun 15, 2021

Choose a reason for hiding this comment

thomasjpfan Jun 19, 2021

Choose a reason for hiding this comment

glemaitre commented Jul 23, 2021

thomasjpfan commented Jul 24, 2021

glemaitre commented Dec 16, 2021

MaxwellLZH commented Dec 16, 2021

MaxwellLZH commented Dec 21, 2021