New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-44075: Add asyncio.stalled audit hook #25990
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This PR is stale because it has been open for 30 days with no activity. |
@@ -62,6 +62,8 @@ When the debug mode is enabled: | |||
minimum execution duration in seconds that is considered "slow". | |||
|
|||
|
|||
.. audit-event:: asyncio.stalled time,handle,formatted_handle,loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs a third argument that is the anchor that should be on the link to get back here (probably asyncio-debug-mode
but you can look at the existing docs on the web site to see which section is best)
else: | ||
try: | ||
self._current_handle = handle | ||
t0 = self.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough to know whether these are okay to do even when not debugging, particularly the _current_handle
assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report should be in debug mode only, sure.
Two timer calls per callback are not free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way of detecting a stall? I did a quick bit of bechmarking and a raw, non-lookup'd call to time.time()
or time.monotonic()
is about as fast as you can get (~50ns), on-par with invoking lambda: None
or def foo(): pass
.
A 100ns overhead per callback leaves a lot of breathing room.
else: | ||
try: | ||
self._current_handle = handle | ||
t0 = self.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report should be in debug mode only, sure.
Two timer calls per callback are not free.
dt = self.time() - t0 | ||
if dt >= self.slow_callback_duration: | ||
formatted_handle = _format_handle(handle) | ||
sys.audit("asyncio.stalled", dt, handle, formatted_handle, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all three arguments (handle, formatted handle, and loop) are passed?
I suspect the only handle is enough for a low-level tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It might also help to have an event at the point where a handle is created, so there's another event to correlate with, but you certainly don't want to be eagerly formatting anything.
Assume the audit hook is merely print
to a log that's going to be interpreted by a human (or a separate machine). It's not an opportunity for an application to change its behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
If we add two events (handle started and handle finished) asyncio doesn't need to calculate the handle callback execution time.
The audit hook can do it itself if needed.
Also, it means that both hooks can be called even in non-debug mode.
The calling is cheap, analyzing can be expensive but it is not an asyncio problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating the handle callback execution time seems really quite cheap, unless I'm missing something. In fact it seems it would be far cheaper to calculate the time difference and invoke an audit hook once rather than invoke an audit hook twice:
In [2]: sys.addaudithook(lambda *f: None)
In [3]: a = sys.audit
In [5]: %timeit a("foo")
174 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [7]: m = time.monotonic
In [8]: %timeit m()
60.3 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People who care about maximum performance won't have any audit hooks (or they'll have native ones, which are considerable quicker than pure Python ones), so it's not a fair the right comparison.
Still, I'd rather audit events be used for unusual/unexpected events rather than tracing. If this is a tight loop operation where 0.1us is important enough for performance to override functionality concerns, then we don't want audit events being raised that frequently. If it's fairly uncommon then may as well do the calculation and only raise relevant events to keep the noise down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that an event loop stall should be both unusual and unexpected. But you're right, I'm not sure audit events are a perfect fit here.
The whole point behind this is that it's hard, or even impossible, to get metrics about event loop stalls in production right now without running your entire app in async debug mode. Which is interesting, because people who care about maximum performance would surely want accurate metrics on stalls whilst being totally adverse to running things in debug mode.
Some way to do something in response to a stall would be ideal. We maybe even don't need an exact time, just an idea that a stall took place and a way to consume that to do custom stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see different numbers with the fresh main branch:
>>> import timeit
>>> timeit.repeat("sys.audit(1); sys.audit(1)", "import sys")
[0.09353699500206858, 0.07102967600076227, 0.07030134299566271, 0.07094913300534245, 0.06989944599627052]
>>> timeit.repeat("time.monotonic(); time.monotonic()", "import time")
[0.19186055000318447, 0.16878722599358298, 0.16965712599630933, 0.1693656099960208, 0.16873898899939377]
Note: there is no audit hook subscribed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking up the monotonic
attribute dominates the benchmark. You get much better speeds by assigning time.monotonic
to a local variable once and using that.
I'm using 3.10, so maybe things have changed, but sys.audit(1)
throws an exception for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest measuring Python 3.11 and the real scenario.
E.g. two loop.time()
calls plus one sys.audit()
vs two sys.audit()
calls.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
https://bugs.python.org/issue44075