Skip to content

gh-88241: Add asyncio.stalled audit hook #25990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Doc/library/asyncio-dev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ When the debug mode is enabled:
minimum execution duration in seconds that is considered "slow".


.. audit-event:: asyncio.stalled time,handle,formatted_handle,loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a third argument that is the anchor that should be on the link to get back here (probably asyncio-debug-mode but you can look at the existing docs on the web site to see which section is best)


.. _asyncio-multithreading:

Concurrency and Multithreading
Expand Down
24 changes: 12 additions & 12 deletions Lib/asyncio/base_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -1866,19 +1866,19 @@ def _run_once(self):
handle = self._ready.popleft()
if handle._cancelled:
continue
if self._debug:
try:
self._current_handle = handle
t0 = self.time()
handle._run()
dt = self.time() - t0
if dt >= self.slow_callback_duration:
logger.warning('Executing %s took %.3f seconds',
_format_handle(handle), dt)
finally:
self._current_handle = None
else:
try:
self._current_handle = handle
t0 = self.time()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough to know whether these are okay to do even when not debugging, particularly the _current_handle assignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report should be in debug mode only, sure.
Two timer calls per callback are not free.

Copy link
Contributor Author

@orf orf Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way of detecting a stall? I did a quick bit of bechmarking and a raw, non-lookup'd call to time.time() or time.monotonic() is about as fast as you can get (~50ns), on-par with invoking lambda: None or def foo(): pass.

A 100ns overhead per callback leaves a lot of breathing room.

handle._run()
dt = self.time() - t0
if dt >= self.slow_callback_duration:
formatted_handle = _format_handle(handle)
sys.audit("asyncio.stalled", dt, handle, formatted_handle, self)
Copy link
Contributor

@asvetlov asvetlov Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all three arguments (handle, formatted handle, and loop) are passed?
I suspect the only handle is enough for a low-level tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It might also help to have an event at the point where a handle is created, so there's another event to correlate with, but you certainly don't want to be eagerly formatting anything.

Assume the audit hook is merely print to a log that's going to be interpreted by a human (or a separate machine). It's not an opportunity for an application to change its behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.
If we add two events (handle started and handle finished) asyncio doesn't need to calculate the handle callback execution time.
The audit hook can do it itself if needed.
Also, it means that both hooks can be called even in non-debug mode.
The calling is cheap, analyzing can be expensive but it is not an asyncio problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating the handle callback execution time seems really quite cheap, unless I'm missing something. In fact it seems it would be far cheaper to calculate the time difference and invoke an audit hook once rather than invoke an audit hook twice:

In [2]: sys.addaudithook(lambda *f: None)
In [3]: a = sys.audit
In [5]: %timeit a("foo")
174 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: m = time.monotonic
In [8]: %timeit m()
60.3 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Member

@zooba zooba Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People who care about maximum performance won't have any audit hooks (or they'll have native ones, which are considerable quicker than pure Python ones), so it's not a fair the right comparison.

Still, I'd rather audit events be used for unusual/unexpected events rather than tracing. If this is a tight loop operation where 0.1us is important enough for performance to override functionality concerns, then we don't want audit events being raised that frequently. If it's fairly uncommon then may as well do the calculation and only raise relevant events to keep the noise down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that an event loop stall should be both unusual and unexpected. But you're right, I'm not sure audit events are a perfect fit here.

The whole point behind this is that it's hard, or even impossible, to get metrics about event loop stalls in production right now without running your entire app in async debug mode. Which is interesting, because people who care about maximum performance would surely want accurate metrics on stalls whilst being totally adverse to running things in debug mode.

Some way to do something in response to a stall would be ideal. We maybe even don't need an exact time, just an idea that a stall took place and a way to consume that to do custom stuff ™️.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see different numbers with the fresh main branch:

>>> import timeit
>>> timeit.repeat("sys.audit(1); sys.audit(1)", "import sys")
[0.09353699500206858, 0.07102967600076227, 0.07030134299566271, 0.07094913300534245, 0.06989944599627052]
>>> timeit.repeat("time.monotonic(); time.monotonic()", "import time")
[0.19186055000318447, 0.16878722599358298, 0.16965712599630933, 0.1693656099960208, 0.16873898899939377]

Note: there is no audit hook subscribed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking up the monotonic attribute dominates the benchmark. You get much better speeds by assigning time.monotonic to a local variable once and using that.

I'm using 3.10, so maybe things have changed, but sys.audit(1) throws an exception for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest measuring Python 3.11 and the real scenario.
E.g. two loop.time() calls plus one sys.audit() vs two sys.audit() calls.

if self._debug:
logger.warning('Executing %s took %.3f seconds',
formatted_handle, dt)
finally:
self._current_handle = None
handle = None # Needed to break cycles when an exception occurs.

def _set_coroutine_origin_tracking(self, enabled):
Expand Down
19 changes: 19 additions & 0 deletions Lib/test/audit-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,25 @@ def hook(event, *args):
raise RuntimeError("Expected sqlite3.load_extension to fail")


def test_asyncio():
import asyncio

def hook(event, args):
if event == "asyncio.stalled":
print(event, *args[1:])

sys.addaudithook(hook)

ev = asyncio.new_event_loop()
ev.slow_callback_duration = 0.0

def stop_loop(loop):
loop.stop()

ev.call_soon(stop_loop, ev)
ev.run_forever()


if __name__ == "__main__":
from test.support import suppress_msvcrt_asserts

Expand Down
8 changes: 8 additions & 0 deletions Lib/test/test_audit.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,14 @@ def test_sqlite3(self):
]
self.assertEqual(actual, expected)

def test_asyncio(self):
returncode, events, stderr = self.run_python("test_asyncio")
if returncode:
self.fail(stderr)
if support.verbose:
print(*events, sep='\n')
self.assertEqual(events[0][0], "asyncio.stalled")


if __name__ == "__main__":
unittest.main()
1 change: 1 addition & 0 deletions Misc/NEWS.d/next/Security/2021-05-08.bpo-44075.bS3XJ9.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add audit events for Asyncio loop stalling. Patch by Tom Forbes.