Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby(..., dropna=False).indices with single group key does not include nan group #35646

Closed
3 tasks done
mroeschke opened this issue Aug 10, 2020 · 2 comments · Fixed by #36842
Closed
3 tasks done
Milestone

Comments

@mroeschke
Copy link
Member

@mroeschke mroeschke commented Aug 10, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here
In [9]: data = {'group':['g1', 'g1', 'g1', np.nan, 'g1', 'g1', 'g2', 'g2', 'g2', 'g2', np.nan],
   ...:                     'A':[3, 1, 8, 2, 6, -1, 0, 13, -4, 0, 1],
   ...:                     'B':[5, 2, 3, 7, 11, -1, 4,-1, 1, 0, 2]}
   ...: df = pd.DataFrame(data)
   ...: df.groupby('group',dropna=True).indices
Out[9]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9])}

In [10]: df.groupby('group',dropna=False).indices
Out[10]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9])}

In [11]: pd.__version__
Out[11]: '1.2.0.dev0+67.gaefae55e1'

Problem description

The grouping codes + indices are determined for a single group by key here

values = Categorical(self.grouper)

And Categorical does not support nan as a label (only a missing -1 code)

This works correctly if multiple group keys are passed

Once this issue is addressed, #35542 will be fixed

Expected Output

In [10]: df.groupby('group',dropna=False).indices
Out[10]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9]), np.nan: array([3, 10]}
@phofl
Copy link
Member

@phofl phofl commented Sep 6, 2020

@mroeschke

I looked a bit into this and have a question about the preferred solution: Is it ok to replace the nan with a unique string/integer before calling values = Categorical(self.grouper) and changing it back afterwards? Only in case of ``dropna=False```of course.

@mroeschke
Copy link
Member Author

@mroeschke mroeschke commented Sep 6, 2020

I don't think that would be an ideal solution.

I think a better solution would just be to refactor the code path to use the logic used for multi group keys since I don't think it's planned to support Categorical(..., dropna=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants