New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Extend PDP for nominal categorical features #18298
Conversation
@@ -644,12 +648,18 @@ def test_partial_dependence_dataframe(estimator, preprocessor, features): | |||
|
|||
@pytest.mark.parametrize( | |||
"features, expected_pd_shape", | |||
[(0, (3, 10)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the API spec of partial_dependence()
only allows for array-like of {int, str}
for feature
parameter (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_partial_dependence.py#L248), we seem to be allowing int
, str
, list of str
, list of int
, and boolean mask
.
This unnecessarily complicated initialization of is_categorical
parameter when it's None
at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-0c8623ee957e0256d29ac53830094281R488-R489.
Do we need to keep supporting this undocumented behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case was supported before but the documentation is misleading. Basically we should support a sequence of {int, str} or tuple of {int, str}:
So the test here create a scalar or a tuple (actually this is a list) which will be added in another list within the test.
Maybe changing the documentation with the following would be less ambiguous: features : array-like of {int, str} or tuple of 2 {int, str}
@@ -453,6 +501,13 @@ class PartialDependenceDisplay: | |||
|
|||
.. versionadded:: 0.24 | |||
|
|||
is_categorical : list of (bool,) or list of (bool, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is possible to make a 2-way partial dependence plot where 1 of the features only is categorical, meaning something like: is_categorical=[(True, False)]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe not. That's what's being checked at https://github.com/scikit-learn/scikit-learn/pull/18298/files#diff-51e11eb022da44742c62c8a5d33c00d4R357-R359. Probably best to specify that in docs itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could only accept list of bool
and raise an error if True
is given for a tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to follow the same shape of features (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/inspection/_plot/partial_dependence.py#L405-L408) as discussed in #14969 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can support is_categorical=[(True, True)]
with a grid and colour of the cell denoting the partial dependency for the category combination. But, if this is not required, I am fine to go with a list of bool
and raise an error if True
is given for a tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we should replace it with categorical_features
as well.
It might mean that we have to validate the parameter in the Display
instead of the do it in the plot function.
@glemaitre I wonder who I should go about updating the PDP/ICE example (https://scikit-learn.org/dev/auto_examples/inspection/plot_partial_dependence.html). The current dataset used in there does not have categorical columns and I don't think it's worth introducing a new dataset. |
@madhuracj I added this to the 1.0 milestone. Let's target it.
Yes, you are right. We might want to add an example specifically for categorical columns. The current example is already quite rich. |
I have merged the upstream main branch which contained some major refactoring introduced since. Hopefully, I can find some time in the next couple of days to work on this. |
@glemaitre With 7c1fe2e I have implemented two-way PDP for categorical features. Please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @ogrisel do you have more comments ?
I will give it another quick look. |
I get the following warning raised by pandas 1.5.1 in the example:
|
I also get it with pandas 1.5.2. |
Uhm. And the warning is something that we want to make the change in place. Not sure how to silent it without catching it. |
We already step in those: pandas-dev/pandas#47381 |
ax=ax, | ||
**common_params, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, this figure has a too narrow ylim
but @jeremiedbb and I do not understand why. This problem does not happen to the other plots above. Changing constrained_layout
does not seem to impact this.
pdp_lim
is left to its default (None
) value.
The warning comes from trying to do The behavior won't change in the future in this situation though so we thought that can safely filter it. I added a comment and a todo to remind us to check when it's possible remove the filter. |
I pushed a cosmetic commit to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (assuming green CI)
I forgot to update the tests. I am on it. |
Thank you, @madhuracj and @glemaitre! |
It is a very good improvement to have this PR shipped with the next release. Congrats! On the other hand, I wonder what happened with @NicolasHug comments in #18298 (comment). I was not able to follow the discussion and now I fear that we have implemented what @NicolasHug wanted to avoid. |
Indeed, I also missed @NicolasHug's comment. I also think it was a good idea... But we can always implement it on top of the existing |
This could be implemented with In the inferring setup, I am not entirely sure how to deal with the case we detect a continuous and discretised feature. We will error but it could be a case where it should be 2 continuous features with one feature with few values. In this setting, without any parameter, a user will never be able to plot anything. Then, one would need to expose the I am also not sure about computing |
This should be possible, although ideally we would only need one parameter instead of 2. If there are plans to support something like #18298 (comment), it might be worth changing the current parameter name from Regarding computing unique values on continuous features, this is a fairly common practice I think and shouldn't be a deal-breaker for plotting utilities where perf isn't critical. We already do that in PDPs, or in the binner of the hist-GBDT. |
Reference Issues/PRs
closes #14969
What does this implement/fix? Explain your changes.
This PR extends PDP for categorical features. Still WIP
Any other comments?