ENH Extend PDP for nominal categorical features #18298

madhuracj · 2020-08-30T02:00:02Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR extends PDP for categorical features. Still WIP

Any other comments?

madhuracj · 2020-08-30T04:19:31Z

sklearn/inspection/tests/test_partial_dependence.py

@@ -644,12 +648,18 @@ def test_partial_dependence_dataframe(estimator, preprocessor, features):

 @pytest.mark.parametrize(
    "features, expected_pd_shape",
-    [(0, (3, 10)),


Even though the API spec of partial_dependence() only allows for array-like of {int, str} for feature parameter (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_partial_dependence.py#L248), we seem to be allowing int, str, list of str, list of int, and boolean mask.
This unnecessarily complicated initialization of is_categorical parameter when it's None at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-0c8623ee957e0256d29ac53830094281R488-R489.
Do we need to keep supporting this undocumented behaviour?

The use case was supported before but the documentation is misleading. Basically we should support a sequence of {int, str} or tuple of {int, str}:

https://github.com/scikit-learn/scikit-learn/pull/12599/files#diff-f6ab89354a7e7254ec59bb2e192b0ec2R211-R218

So the test here create a scalar or a tuple (actually this is a list) which will be added in another list within the test.

Maybe changing the documentation with the following would be less ambiguous: features : array-like of {int, str} or tuple of 2 {int, str}

glemaitre · 2020-09-04T09:55:13Z

sklearn/inspection/_plot/partial_dependence.py

@@ -453,6 +501,13 @@ class PartialDependenceDisplay:

        .. versionadded:: 0.24

+    is_categorical : list of (bool,) or list of (bool, bool)


Is possible to make a 2-way partial dependence plot where 1 of the features only is categorical, meaning something like: is_categorical=[(True, False)]?

I believe not. That's what's being checked at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-51e11eb022da44742c62c8a5d33c00d4R357-R359. Probably best to specify that in docs itself?

So we could only accept list of bool and raise an error if True is given for a tuple.

My idea was to follow the same shape of features (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_plot/partial_dependence.py#L405-L408) as discussed in #14969 (comment)

We can support is_categorical=[(True, True)] with a grid and colour of the cell denoting the partial dependency for the category combination. But, if this is not required, I am fine to go with a list of bool and raise an error if True is given for a tuple.

Actually, we should replace it with categorical_features as well.
It might mean that we have to validate the parameter in the Display instead of the do it in the plot function.

madhuracj · 2020-09-10T13:06:41Z

@glemaitre I wonder who I should go about updating the PDP/ICE example (https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html). The current dataset used in there does not have categorical columns and I don't think it's worth introducing a new dataset.

glemaitre · 2021-01-18T10:30:31Z

@madhuracj I added this to the 1.0 milestone. Let's target it.

@glemaitre I wonder who I should go about updating the PDP/ICE example (https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html). The current dataset used in there does not have categorical columns and I don't think it's worth introducing a new dataset.

Yes, you are right. We might want to add an example specifically for categorical columns. The current example is already quite rich.

madhuracj · 2021-02-03T02:54:48Z

@madhuracj I added this to the 1.0 milestone. Let's target it.

I have merged the upstream main branch which contained some major refactoring introduced since. Hopefully, I can find some time in the next couple of days to work on this.

madhuracj · 2021-02-08T06:01:06Z

@glemaitre With 7c1fe2e I have implemented two-way PDP for categorical features. Please have a look.

jeremiedbb

LGTM. @ogrisel do you have more comments ?

ogrisel · 2022-11-25T10:43:22Z

I will give it another quick look.

ogrisel · 2022-11-25T12:09:29Z

I get the following warning raised by pandas 1.5.1 in the example:

/Users/ogrisel/code/scikit-learn/sklearn/utils/__init__.py:386: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  X.iloc[row_indexer, column_indexer] = values

ogrisel · 2022-11-25T12:18:53Z

I also get it with pandas 1.5.2.

glemaitre · 2022-11-25T13:08:15Z

Uhm. And the warning is something that we want to make the change in place. Not sure how to silent it without catching it.

glemaitre · 2022-11-25T13:09:40Z

We already step in those: pandas-dev/pandas#47381

ogrisel · 2022-11-25T13:18:56Z

examples/inspection/plot_partial_dependence.py

    ax=ax,
+    **common_params,
 )


For some reason, this figure has a too narrow ylim but @jeremiedbb and I do not understand why. This problem does not happen to the other plots above. Changing constrained_layout does not seem to impact this.

pdp_lim is left to its default (None) value.

jeremiedbb · 2022-11-25T13:44:56Z

The warning comes from trying to do df.iloc[:, i] = value where the ith column has the "category" dtype and value is a string, because pandas is expecting value to be of object dtype, but since it's just a scalar it's a string.

The behavior won't change in the future in this situation though so we thought that can safely filter it. I added a comment and a todo to remind us to check when it's possible remove the filter.

ogrisel · 2022-11-25T14:06:03Z

I pushed a cosmetic commit to:

pass categorical_features to the HGBRT models (since we already extract the list of feature names for the PDP plot);
use constrained layout consistently to avoid manual adjustments of whitespaces.

ogrisel

LGTM! (assuming green CI)

ogrisel · 2022-11-25T15:03:18Z

I forgot to update the tests. I am on it.

jjerphan · 2022-11-25T16:36:17Z

Thank you, @madhuracj and @glemaitre! 🎉

lorentzenchr · 2022-11-26T13:40:33Z

It is a very good improvement to have this PR shipped with the next release. Congrats!

On the other hand, I wonder what happened with @NicolasHug comments in #18298 (comment). I was not able to follow the discussion and now I fear that we have implemented what @NicolasHug wanted to avoid.

ogrisel · 2022-11-28T10:03:13Z

Indeed, I also missed @NicolasHug's comment. I also think it was a good idea... But we can always implement it on top of the existing categorical_features parameter, that is, when the user does not explicitly pass the categorical features.

glemaitre · 2022-11-28T10:56:14Z

This could be implemented with categorical_features="infer" ("auto" could use the "category" dtype).

In the inferring setup, I am not entirely sure how to deal with the case we detect a continuous and discretised feature. We will error but it could be a case where it should be 2 continuous features with one feature with few values. In this setting, without any parameter, a user will never be able to plot anything. Then, one would need to expose the n_unique to get away with this issue.

I am also not sure about computing unique values on continuous features.

NicolasHug · 2022-11-28T11:23:51Z

we can always implement it on top of the existing categorical_features parameter, that is, when the user does not explicitly pass the categorical features

This should be possible, although ideally we would only need one parameter instead of 2. If there are plans to support something like #18298 (comment), it might be worth changing the current parameter name from categorical_features to something more generic that can also account for the other parametrization.

Regarding computing unique values on continuous features, this is a fairly common practice I think and shouldn't be a deal-breaker for plotting utilities where perf isn't critical. We already do that in PDPs, or in the binner of the hist-GBDT.

madhuracj added 2 commits Aug 30, 2020

Extend PDP for categorical features

2f3b52d

PDP method specification only allows for lists

9f066db

github-actions bot added the module:inspection label Aug 30, 2020

madhuracj added 3 commits Aug 30, 2020

Fix unit tests by adding missing parameter

0eb3053

Improve docs

fafa9d2

Wrap a long line

6a7e87d

madhuracj commented Aug 30, 2020

View changes

madhuracj mentioned this pull request Aug 30, 2020

Enhancement for partial dependence plot #14969

Closed

3 tasks

glemaitre added this to TO REVIEW in Guillaume's pet Aug 31, 2020

glemaitre reviewed Sep 4, 2020

View changes

madhuracj added 7 commits Sep 4, 2020

Revert 9f066db

22d66a1

Update docs as suggested and use features_indices which is resolved

eca5ac7

Unit test for categorical support in partial_dependence

6b0d7c2

Remove extra line at the end of the file

284c901

Fix typo

ed254ca

Tests for plot_partial_dependence()

7bca03f

Remove redundant check

ecf2bf8

glemaitre added this to the 1.0 milestone Jan 18, 2021

Base automatically changed from master to main Jan 22, 2021

glemaitre added this to In progress in Interpretability / Plotting / Interactive dev Feb 1, 2021

madhuracj added 2 commits Feb 1, 2021

Merge remote-tracking branch 'upstream/main' into categorical_pdp

b16aa23

Fix linting

fe6580b

madhuracj added 3 commits Feb 3, 2021

Update version introduced

cebadbf

Add an example for PDP on categorical features

708e40c

PDP for two-way categorical features

7c1fe2e

Wrap long lines

dc45e16

glemaitre and others added 4 commits Nov 24, 2022

better tweak

bf3d741

Merge branch 'main' into categorical_pdp

92a6f90

Merge remote-tracking branch 'upstream/main' into pr/madhuracj/18298

8b42d1f

fix what's new

0c96552

jeremiedbb approved these changes Nov 25, 2022

View changes

ogrisel reviewed Nov 25, 2022

View changes

filter the iloc warning

0bc73ed

Pass categorical_features to HGBDT + avoid calling plt.subplots_adjust

d5f4c48

Fix pdp_lim

1ee652f

ogrisel approved these changes Nov 25, 2022

View changes

ogrisel and others added 3 commits Nov 25, 2022

Fix test_partial_dependence_plot_limits_two_way

a530c86

Fix test_partial_dependence_plot_limits_one_way

5606e52

Merge branch 'main' into categorical_pdp

bb608d5

jjerphan merged commit c1cfc4d into scikit-learn:main Nov 25, 2022
27 checks passed

Interpretability / Plotting / Interactive dev automation moved this from Reviewer approved to Done Nov 25, 2022

madhuracj mentioned this pull request Nov 27, 2022

DOC correct link for image in the PDP documentation #25054

Merged

lorentzenchr mentioned this pull request Nov 28, 2022

DOC set categorical PDP as MajorFeature #25056

Closed

ENH Extend PDP for nominal categorical features #18298

ENH Extend PDP for nominal categorical features #18298

madhuracj commented Aug 30, 2020 •

edited by glemaitre

madhuracj Aug 30, 2020

glemaitre Sep 4, 2020

glemaitre Sep 4, 2020

madhuracj Sep 4, 2020

glemaitre Sep 4, 2020

madhuracj Sep 4, 2020 •

edited

madhuracj Feb 3, 2021

glemaitre Jul 9, 2021

madhuracj commented Sep 10, 2020

glemaitre commented Jan 18, 2021

madhuracj commented Feb 3, 2021

madhuracj commented Feb 8, 2021

jeremiedbb left a comment

ogrisel commented Nov 25, 2022

ogrisel commented Nov 25, 2022 •

edited

ogrisel commented Nov 25, 2022

glemaitre commented Nov 25, 2022

glemaitre commented Nov 25, 2022

ogrisel Nov 25, 2022 •

edited

jeremiedbb commented Nov 25, 2022

ogrisel commented Nov 25, 2022

ogrisel left a comment

ogrisel commented Nov 25, 2022

jjerphan commented Nov 25, 2022

lorentzenchr commented Nov 26, 2022

ogrisel commented Nov 28, 2022 •

edited

glemaitre commented Nov 28, 2022

NicolasHug commented Nov 28, 2022 •

edited

		@@ -453,6 +501,13 @@ class PartialDependenceDisplay:

		.. versionadded:: 0.24

		is_categorical : list of (bool,) or list of (bool, bool)

ENH Extend PDP for nominal categorical features #18298

ENH Extend PDP for nominal categorical features #18298

Conversation

madhuracj commented Aug 30, 2020 • edited by glemaitre

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

madhuracj Aug 30, 2020

Choose a reason for hiding this comment

glemaitre Sep 4, 2020

Choose a reason for hiding this comment

glemaitre Sep 4, 2020

Choose a reason for hiding this comment

madhuracj Sep 4, 2020

Choose a reason for hiding this comment

glemaitre Sep 4, 2020

Choose a reason for hiding this comment

madhuracj Sep 4, 2020 • edited

Choose a reason for hiding this comment

madhuracj Feb 3, 2021

Choose a reason for hiding this comment

glemaitre Jul 9, 2021

Choose a reason for hiding this comment

madhuracj commented Sep 10, 2020

glemaitre commented Jan 18, 2021

madhuracj commented Feb 3, 2021

madhuracj commented Feb 8, 2021

jeremiedbb left a comment

Choose a reason for hiding this comment

ogrisel commented Nov 25, 2022

ogrisel commented Nov 25, 2022 • edited

ogrisel commented Nov 25, 2022

glemaitre commented Nov 25, 2022

glemaitre commented Nov 25, 2022

ogrisel Nov 25, 2022 • edited

Choose a reason for hiding this comment

jeremiedbb commented Nov 25, 2022

ogrisel commented Nov 25, 2022

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Nov 25, 2022

jjerphan commented Nov 25, 2022

lorentzenchr commented Nov 26, 2022

ogrisel commented Nov 28, 2022 • edited

glemaitre commented Nov 28, 2022

NicolasHug commented Nov 28, 2022 • edited

madhuracj commented Aug 30, 2020 •

edited by glemaitre

madhuracj Sep 4, 2020 •

edited

ogrisel commented Nov 25, 2022 •

edited

ogrisel Nov 25, 2022 •

edited

ogrisel commented Nov 28, 2022 •

edited

NicolasHug commented Nov 28, 2022 •

edited