Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby().rolling(freq) with monotonic dates within groups #46065

Merged
merged 1 commit into from Feb 19, 2022

Conversation

@mroeschke
Copy link
Member

@mroeschke mroeschke commented Feb 18, 2022

@mroeschke mroeschke added this to the 1.4.2 milestone Feb 19, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

lgtm, does the iteration in _validate_datetimelike_monotonic have a big performance impact?

@mroeschke
Copy link
Member Author

@mroeschke mroeschke commented Feb 19, 2022

There will be a slight performance impact during the validation (that can be a 1 time hit) that scales with the number of groups. Here's validating 3 vs 100 groups

In [1]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=20, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": [1] * 20 + [2] * 12 + [3] * 8,
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(40),
   ...:             }
   ...:         )

In [2]: %timeit df.groupby("A").rolling("4s", on="B")
764 µs ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]:
   ...:         dates = date_range(start="2016-01-01 09:30:00", periods=50, freq="s")
   ...:         df = DataFrame(
   ...:             {
   ...:                 "A": np.arange(100),
   ...:                 "B": np.concatenate((dates, dates)),
   ...:                 "C": np.arange(100),
   ...:             }
   ...:         )

In [7]: %timeit df.groupby("A").rolling("4s", on="B")
2.26 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mroeschke mroeschke merged commit ad0f1bf into pandas-dev:main Feb 19, 2022
32 of 33 checks passed
meeseeksmachine added a commit to meeseeksmachine/pandas that referenced this issue Feb 19, 2022
@mroeschke mroeschke deleted the bug/gb_roll_mon branch Feb 19, 2022
@rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 19, 2022

@mroeschke - It looks like the call to groupby(...).rolling(...) went from O(1) to O(n), is that right? I would describe that as being something significant. Perhaps it can't be avoided, but I'd like to have another look. In any case, I don't think we should be merging PRs (especially our own) when there are outstanding questions.

if self._on.hasnans:
raise ValueError(f"{on} must not have any NaT values.")
for group_indices in self._grouper.indices.values():
group_on = self._on.take(group_indices)
Copy link
Contributor

@jreback jreback Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is materializing things
prob not a great idea

Copy link
Member Author

@mroeschke mroeschke Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._on is already materialized in __init__. Is that what is meant by materialized?

@jreback
Copy link
Contributor

@jreback jreback commented Feb 19, 2022

@mroeschke there was no hurry on this one

mroeschke added a commit to mroeschke/pandas that referenced this issue Feb 19, 2022
@mroeschke
Copy link
Member Author

@mroeschke mroeschke commented Feb 19, 2022

Sorry I was too eager on the LGTM. #46078 reverts this PR.

This validation is O(number of groups). Happy to have a suggestion regarding a more performant solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants