Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-44075: Add asyncio.stalled audit hook #25990

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

bpo-44075: Add asyncio.stalled audit hook #25990

wants to merge 3 commits into from

Conversation

orf
Copy link
Contributor

@orf orf commented May 8, 2021

@orf orf requested review from 1st1 and asvetlov as code owners May 8, 2021 12:46
@the-knights-who-say-ni

This comment has been minimized.

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jun 14, 2021
@@ -62,6 +62,8 @@ When the debug mode is enabled:
minimum execution duration in seconds that is considered "slow".


.. audit-event:: asyncio.stalled time,handle,formatted_handle,loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a third argument that is the anchor that should be on the link to get back here (probably asyncio-debug-mode but you can look at the existing docs on the web site to see which section is best)

else:
try:
self._current_handle = handle
t0 = self.time()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough to know whether these are okay to do even when not debugging, particularly the _current_handle assignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report should be in debug mode only, sure.
Two timer calls per callback are not free.

Copy link
Contributor Author

@orf orf Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way of detecting a stall? I did a quick bit of bechmarking and a raw, non-lookup'd call to time.time() or time.monotonic() is about as fast as you can get (~50ns), on-par with invoking lambda: None or def foo(): pass.

A 100ns overhead per callback leaves a lot of breathing room.

else:
try:
self._current_handle = handle
t0 = self.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report should be in debug mode only, sure.
Two timer calls per callback are not free.

dt = self.time() - t0
if dt >= self.slow_callback_duration:
formatted_handle = _format_handle(handle)
sys.audit("asyncio.stalled", dt, handle, formatted_handle, self)
Copy link
Contributor

@asvetlov asvetlov Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all three arguments (handle, formatted handle, and loop) are passed?
I suspect the only handle is enough for a low-level tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It might also help to have an event at the point where a handle is created, so there's another event to correlate with, but you certainly don't want to be eagerly formatting anything.

Assume the audit hook is merely print to a log that's going to be interpreted by a human (or a separate machine). It's not an opportunity for an application to change its behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
If we add two events (handle started and handle finished) asyncio doesn't need to calculate the handle callback execution time.
The audit hook can do it itself if needed.
Also, it means that both hooks can be called even in non-debug mode.
The calling is cheap, analyzing can be expensive but it is not an asyncio problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating the handle callback execution time seems really quite cheap, unless I'm missing something. In fact it seems it would be far cheaper to calculate the time difference and invoke an audit hook once rather than invoke an audit hook twice:

In [2]: sys.addaudithook(lambda *f: None)
In [3]: a = sys.audit
In [5]: %timeit a("foo")
174 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: m = time.monotonic
In [8]: %timeit m()
60.3 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Member

@zooba zooba Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People who care about maximum performance won't have any audit hooks (or they'll have native ones, which are considerable quicker than pure Python ones), so it's not a fair the right comparison.

Still, I'd rather audit events be used for unusual/unexpected events rather than tracing. If this is a tight loop operation where 0.1us is important enough for performance to override functionality concerns, then we don't want audit events being raised that frequently. If it's fairly uncommon then may as well do the calculation and only raise relevant events to keep the noise down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that an event loop stall should be both unusual and unexpected. But you're right, I'm not sure audit events are a perfect fit here.

The whole point behind this is that it's hard, or even impossible, to get metrics about event loop stalls in production right now without running your entire app in async debug mode. Which is interesting, because people who care about maximum performance would surely want accurate metrics on stalls whilst being totally adverse to running things in debug mode.

Some way to do something in response to a stall would be ideal. We maybe even don't need an exact time, just an idea that a stall took place and a way to consume that to do custom stuff ™️.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see different numbers with the fresh main branch:

>>> import timeit
>>> timeit.repeat("sys.audit(1); sys.audit(1)", "import sys")
[0.09353699500206858, 0.07102967600076227, 0.07030134299566271, 0.07094913300534245, 0.06989944599627052]
>>> timeit.repeat("time.monotonic(); time.monotonic()", "import time")
[0.19186055000318447, 0.16878722599358298, 0.16965712599630933, 0.1693656099960208, 0.16873898899939377]

Note: there is no audit hook subscribed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking up the monotonic attribute dominates the benchmark. You get much better speeds by assigning time.monotonic to a local variable once and using that.

I'm using 3.10, so maybe things have changed, but sys.audit(1) throws an exception for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest measuring Python 3.11 and the real scenario.
E.g. two loop.time() calls plus one sys.audit() vs two sys.audit() calls.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants