Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH make CalibratedClassifierCV accept on fit_params #18170

Conversation

BenjaminBossan
Copy link
Contributor

@BenjaminBossan BenjaminBossan commented Aug 16, 2020

Reference Issues/PRs

Re-creating PR #15218, the reason being that I probably messed up the rebase
there, leading to 865 files being changed. Here is a clean PR.

Partly addresses #12384

What does this implement/fix? Explain your changes.

This PR makes it possible to pass fit_params to the fit method of the
CalibratedClassifierCV, which are then routed to the underlying base
estimator.

Note: I implemented predict_proba on CheckingClassifier for the
new unit test to work.

BenjaminBossan and others added 13 commits Oct 12, 2019
Partly addresses to scikit-learn#12325

This PR makes it possible to pass fit_params to the fit method of the
CalibratedClassifierCV, which are then routed to the underlying base
estimator.
After adding the predict_proba method on CheckingClassifier, the test
coverage decreased. Therefore, a test for predict_proba was
added. More tests for CheckingClassifier will be added in a separate
PR.
* remove unnecessary check for empty dict
* rename variable name to canonical X from T
Use _safe_indexing to index into fit_params in case they are not numpy
arrays.
Partly addresses to scikit-learn#12325

This PR makes it possible to pass fit_params to the fit method of the
CalibratedClassifierCV, which are then routed to the underlying base
estimator.
After adding the predict_proba method on CheckingClassifier, the test
coverage decreased. Therefore, a test for predict_proba was
added. More tests for CheckingClassifier will be added in a separate
PR.
@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 16, 2020

@cmarmo Thank you for reminding me, I forgot about the PR.

As mentioned above, I recreated the PR because of some rebasing issues. The original PR already had approval by ogrisel.

Apart from addressing the last reviewer comment by jnothman, the main work here was to integrate my changes with a recent PR that introduced parallel fitting in CalibratedClassifierCV via joblib, which necessitated a considerable rewrite of my changes.

@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 16, 2020

An unrelated proposal from my side: We could add a return self after line 244:

self.calibrated_classifiers_.append(calibrated_classifier)
else:
X, y = self._validate_data(

Using this early return, we can remove the else from line 245 and save one level of indentation spanning 50 LOC starting from line 246. If this is in line with the sklearn coding style, I could add that change on top of this PR.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Aug 16, 2020

Using this early return, we can remove the else from line 245 and save one level of indentation spanning 50 LOC starting from line 246. If this is in line with the sklearn coding style, I could add that change on top of this PR.

I think doing this style change in another PR would be better, so we do not increase the diff of this PR. On the proposal, I tend to prefer doing early returns for the same reason you stated.

@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 16, 2020

@thomasjpfan I updated the whats_new. While doing that, I also removed the sklearn.calibrator section, which probably was intended to be sklearn.calibration

estimator_name = type(base_estimator).__name__
warnings.warn("Since %s does not support sample_weights, "
"sample weights will only be used for the "
"calibration itself." % estimator_name)
else:
sample_weight_base_estimator = sample_weight
Copy link
Member

@thomasjpfan thomasjpfan Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case, is sample_weight_base_estimator always equal to sample_weight? If this is the case, I do not think we need a sample_weight_base_estimator parameter in _fit_calibrated_classifer.

for key, val in fit_params.items():
check_consistent_length(y, val)
Copy link
Member

@thomasjpfan thomasjpfan Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the check_consistent_length all at once in fit? This way the parallel calls do not need to validate anymore.

BenjaminBossan and others added 2 commits Aug 18, 2020
* don't use sample_weight_base_estimator, since it's the same as
  sample_weight
* perform check_consistent_length only once, inside fit

In addition:

* added a test for check_consistent_length
@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 18, 2020

@thomasjpfan Very good comments, I addressed the issues you raised.

In this specific case, is sample_weight_base_estimator always equal to sample_weight?

Indeed. I guess it would be possible to have different values if we allowed

calibrated_classifier_cv.fit(X, y, sample_weight=sample_weight, base_estimator__sample_weight=other_sample_weight)

But that is probably overkill.

@BenjaminBossan BenjaminBossan requested a review from thomasjpfan Aug 23, 2020
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Otherwise LGTM

else:
base_estimator_uses_sw = True
Copy link
Member

@thomasjpfan thomasjpfan Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this if statement and defining base_estimator_supports_sw above as follows would lower the complexity:

base_estimator_uses_sw = sample_weight is not None and base_estimator_supports_sw

Copy link
Contributor Author

@BenjaminBossan BenjaminBossan Aug 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

BenjaminBossan added 3 commits Aug 23, 2020
* Clearer way to define base_estimator_uses_sw
Make sure that sample_weight is correctly passed to the
base_estimator. For this CheckingClassifier had to be extended to
include a check for sample_weight.
@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 23, 2020

@thomasjpfan I addressed your comment.

On top of that, I added two more tests, to check if sample_weight is correctly routed to the base estimator. This in turn required me to extend CheckingClassifier to add an option to check for sample_weight.

@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 23, 2021

Although it would be awesome to be on the contributors list for 1.0, there are too many merge conflicts and open questions for a quick resolution. There are probably more pressing changes than this :)

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Aug 30, 2021

I will review this today (and probably push something to solve at least the merge conflicts)

@BenjaminBossan
Copy link
Contributor Author

@BenjaminBossan BenjaminBossan commented Aug 30, 2021

Thanks @glemaitre. I can also take a look at the merge conflicts, but that wouldn't be today anymore (more likely sometimes before end of week).

@adrinjalali adrinjalali removed this from the 1.0 milestone Sep 7, 2021
@adrinjalali adrinjalali added this to the 1.1 milestone Sep 7, 2021
@fingoldo
Copy link

@fingoldo fingoldo commented Dec 13, 2021

When is this expected to be ready? I am currently having a problem when trying to pass eval_set to a CatboostClassifier wrapped into CalibratedClassifierCV.
Without CalibratedClassifierCV, pipe.fit(X_train,y_train,est__eval_set=(X_test_val,y_test_val),est__plot=True) works fine.
With it, both est__eval_set and est__base_estimator__eval_set throw "TypeError: fit() got an unexpected keyword argument." I believe this pr will solve my problem... I can see a big work has been done already, please let's not lose all that efforts guys and get it fixed and approved )

Copy link
Member

@jjerphan jjerphan left a comment

A few suggestions; apart from those, the core of this PR LGTM.

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
sklearn/calibration.py Outdated Show resolved Hide resolved
for sample_aligned_params in fit_params.values():
check_consistent_length(y, sample_aligned_params)

self.calibrated_classifiers_ = []
Copy link
Member

@jjerphan jjerphan Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is self.calibrated_classifiers_ defined here?

Copy link
Contributor

@glemaitre glemaitre Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an error from solving the merge conflict. I will have a look.

sklearn/calibration.py Outdated Show resolved Hide resolved
glemaitre and others added 4 commits Dec 16, 2021
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…com:BenjaminBossan/scikit-learn into pr/BenjaminBossan/18170
@glemaitre glemaitre changed the title CalibratedClassifierCV passes on fit_params ENH make CalibratedClassifierCV accept on fit_params Dec 16, 2021
Copy link
Contributor

@glemaitre glemaitre left a comment

I think that I am fine with the current implementation. @jjerphan do you want to have new look.

Copy link
Member

@jjerphan jjerphan left a comment

LGTM. Thank you for initiating this work, @BenjaminBossan; thank you, @glemaitre, for pursuing it.

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@jjerphan jjerphan merged commit a3e0d4d into scikit-learn:main Dec 17, 2021
31 checks passed
@kabartay
Copy link

@kabartay kabartay commented Mar 10, 2022

When is this expected to be ready? I am currently having a problem when trying to pass eval_set to a CatboostClassifier wrapped into CalibratedClassifierCV. Without CalibratedClassifierCV, pipe.fit(X_train,y_train,est__eval_set=(X_test_val,y_test_val),est__plot=True) works fine. With it, both est__eval_set and est__base_estimator__eval_set throw "TypeError: fit() got an unexpected keyword argument." I believe this pr will solve my problem... I can see a big work has been done already, please let's not lose all that efforts guys and get it fixed and approved )

I have similar issue, didn't work for me. Similarly, it worked without CalibratedClassifierCV but not with it.

@kabartay
Copy link

@kabartay kabartay commented Mar 10, 2022

@glemaitre @thomasjpfan @jjerphan @BenjaminBossan
I need some help with proper usage of fit_params in Pipeline and CalibratedClassifierCV (my sklearn version is v0.24.2)

For example, a Pipeline with simple preprocessing and CalibratedClassifierCV with some ANN as a base_estimator

model_preprocessing = ("preprocessing", ColumnTransformer([
        ('categorical', 'passthrough', categoricals),]), categoricals),  
        ('numerical', Pipeline([("scaler", RobustScaler()),("imputer", SimpleImputer())]), numericals)),
        ('drop', 'drop', drops),
    ], remainder='drop'))
calibrated_classifier = ("calibrated_classifier", CalibratedClassifierCV(
    base_estimator=model_class(**model_parameters), method='isotonic', cv=5))
pipeline = Pipeline([model_preprocessing, calibrated_classifier])

Some grid search with GridSearchCV
model = GridSearchCV(estimator=pipeline,param_grid=param_grid,cv=5, scoring='accuracy',refit=False, verbose=2)

Then I try to add fit_params, for example, epochs, patiencem etc.

fit_params = {}
fit_params['calibrated_classifier__base_estimator__epochs'] = 100
fit_params['calibrated_classifier__base_estimator__patience'] = 10

And do fitting which errors:
model.fit(x_train, y_train, **fit_params)

with TypeError: fit() got an unexpected keyword argument 'epochs':
File "/home/utilisateur/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, **kwargs)
File "/home/utilisateur/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py", line 880, in fit self.best_estimator_.fit(X, y, **fit_params)
File "/home/utilisateur/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 346, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step)

cannot understand why it doesn't properly pass fit_params to underlying estimator
without CalibratedClassifierCV I run properly, for example if use this kinda pipeline

model_preprocessing = ("preprocessing", ColumnTransformer([
        ('categorical', 'passthrough', categoricals),]), categoricals),
        ('numerical', Pipeline([("scaler", RobustScaler()),("imputer", SimpleImputer())]), numericals)),
        ('drop', 'drop', drops),
    ], remainder='drop'))
basic_classifier = ("basic_classifier", model_class(**model_parameters))
pipeline = Pipeline([model_preprocessing,basic_classifier])

Might I use not proper sklearn version or passing wrong name (not sure issue is here, I tried various stuff)?
Thanks for help.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Mar 10, 2022

fit_params will only be supported in scikit-learn 1.1 has not been released yet and will be released in the coming month.

@kabartay
Copy link

@kabartay kabartay commented Mar 11, 2022

fit_params will only be supported in scikit-learn 1.1 has not been released yet and will be released in the coming month.

@glemaitre thanks for letting me know. Is there approximate time such as end of March, mid-April, end April?
Now trying to test v1.1.dev0 just cloning, building and trying to use.

Getting these errors once I do import of CalibratedClassifierCV

Traceback (most recent call last):
  File "main/__main__.py", line 22, in <module>
    from main.train_nn_pipe_dev_cc import train_nn_pipe_dev_cc_model
  File "/home/main/pipe_dev.py", line 45, in <module>
    from scikit.sklearn.calibration import CalibratedClassifierCV
  File "/home/main/scikit/sklearn/calibration.py", line 48, in <module>
    from .svm import LinearSVC
  File "/home/main/scikit/sklearn/svm/__init__.py", line 13, in <module>
    from ._classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, LinearSVR
  File "/home/main/scikit/sklearn/svm/_classes.py", line 6, in <module>
    from ..linear_model._base import LinearClassifierMixin, SparseCoefMixin, LinearModel
  File "/home/main/scikit/sklearn/linear_model/__init__.py", line 11, in <module>
    from ._least_angle import (
  File "/home/main/scikit/sklearn/linear_model/_least_angle.py", line 28, in <module>
    from ..model_selection import check_cv
  File "/home/main/scikit/sklearn/model_selection/__init__.py", line 23, in <module>
    from ._validation import cross_val_score
  File "/home/main/scikit/sklearn/model_selection/_validation.py", line 31, in <module>
    from ..metrics import check_scoring
  File "/home/main/scikit/sklearn/metrics/__init__.py", line 41, in <module>
    from . import cluster
  File "/home/main/scikit/sklearn/metrics/cluster/__init__.py", line 22, in <module>
    from ._unsupervised import silhouette_samples
  File "/home/main/scikit/sklearn/metrics/cluster/_unsupervised.py", line 16, in <module>
    from ..pairwise import pairwise_distances_chunked
  File "/home/main/scikit/sklearn/metrics/pairwise.py", line 35, in <module>
    from ._pairwise_distances_reduction import PairwiseDistancesArgKmin
  File "sklearn/metrics/_pairwise_distances_reduction.pyx", line 1, in init sklearn.metrics._pairwise_distances_reduction
ModuleNotFoundError: No module named 'sklearn.utils._heap'

I have no idea why the ModuleNotFoundError: No module named 'sklearn.utils._heap' is raising, it seems relative paths are correct in sklearn.metrics._pairwise_distances_reduction

Any clue?

@kabartay
Copy link

@kabartay kabartay commented Mar 11, 2022

@glemaitre what can be the reason that causes failing certain tests with TypeError
For example,

  • test_linear_regression_positive

    • sklearn/linear_model/tests/test_base.py:276:
      TypeError: _nnls.nnls() missing required argument 'n' (pos 3)
    • sklearn/linear_model/_base.py:687: TypeError
  • test_linear_regression_positive_multiple_outcome

    • sklearn/linear_model/tests/test_base.py:301:
      sklearn/linear_model/_base.py:690: in fit
      /home/.local/lib/python3.8/site-packages/joblib/parallel.py:1041: in __call__
      /home/.local/lib/python3.8/site-packages/joblib/parallel.py:859: in dispatch_one_batch
      /home/.local/lib/python3.8/site-packages/joblib/parallel.py:777: in _dispatch
      /home/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py:572: in __init__
      /home/.local/lib/python3.8/site-packages/joblib/parallel.py:262: in __call__
      TypeError: _nnls.nnls() missing required argument 'n' (pos 3) -> finally lead to this
    • sklearn/utils/fixes.py:118: TypeError
  • test_linear_regression_positive_vs_nonpositive

    • sklearn/linear_model/tests/test_base.py:315:
      TypeError: _nnls.nnls() missing required argument 'n' (pos 3)
    • sklearn/linear_model/_base.py:687: TypeError
  • test_linear_regression_positive_vs_nonpositive_when_positive

    • sklearn/linear_model/tests/test_base.py:331:
      TypeError: _nnls.nnls() missing required argument 'n' (pos 3)
    • sklearn/linear_model/_base.py:687: TypeError
  • test_estimatorclasses_positive_constraint

    • sklearn/linear_model/tests/test_least_angle.py:609:
    • sklearn/linear_model/_least_angle.py:2237: in fit
    • sklearn/linear_model/_least_angle.py:2286: in _estimate_noise_variance
      TypeError: _nnls.nnls() missing required argument 'n' (pos 3) -> finally lead to this
    • sklearn/linear_model/_base.py:687: TypeError

Totally 5 failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants