Add a PEP578 audit hook for Asyncio loop stalls #88241

orf · 2021-05-08T12:19:46Z

BPO	44075
Nosy	@tiran, @asvetlov, @1st1, @zooba, @orf
PRs	gh-88241: Add asyncio.stalled audit hook #25990

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-05-08.12:19:46.330>
labels = ['3.11', 'type-feature', 'expert-asyncio']
title = 'Add a PEP578 audit hook for Asyncio loop stalls'
updated_at = <Date 2022-03-19.12:00:57.728>
user = 'https://github.com/orf'

bugs.python.org fields:

activity = <Date 2022-03-19.12:00:57.728>
actor = 'asvetlov'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['asyncio']
creation = <Date 2021-05-08.12:19:46.330>
creator = 'orf'
dependencies = []
files = []
hgrepos = []
issue_num = 44075
keywords = ['patch']
message_count = 6.0
messages = ['393251', '393253', '393255', '393263', '393267', '415542']
nosy_count = 5.0
nosy_names = ['christian.heimes', 'asvetlov', 'yselivanov', 'steve.dower', 'orf']
pr_nums = ['25990']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue44075'
versions = ['Python 3.11']

orf · 2021-05-08T12:19:46Z

Detecting and monitoring loop stalls in a production asyncio application is more difficult than it could be.

Firstly you must enable debug mode for the entire loop then you need to look for warnings outputted via the asyncio logger. This makes it hard to send loop stalls to monitoring systems via something like statsd.

Ideally asyncio callbacks would always be timed and an auditevent always triggered if it passes a particular threshold. If debug mode is enabled then a warning is logged.

tiran · 2021-05-08T12:25:11Z

Are you proposing to use PEP-578 for monitoring the event loop?

orf · 2021-05-08T12:33:37Z

I don't see why we shouldn't use PEP-578 for this - the events provide rich monitoring information about what a Python process is "doing" with an easy, central way to register callbacks to receive these events and shovel them off to a monitoring solution.

Is there that much of a difference between monitoring the number of files, sockets, emails or even web browsers opened and the number of times an asyncio application has stalled?

The alternative would be to make the loop stalling some kind of hookable event, which just seems like reinventing sys.audit().

zooba · 2021-05-08T15:58:01Z

Fundamentally I don't have an issue with the audit hook. My only concern would be if there's anything that an application may do to _respond_ to a stall (e.g. is this valuable for applying backpressure? etc.)

If it's purely diagnostic, and there's nothing you'd do in production when it happens, then an audit hook is perfect.

orf · 2021-05-08T16:11:02Z

Actually reacting to a stall would require something more and probably should be done at some point.

But this is purely about monitoring - in our use case we'd send a metric via statsd that would be used to correlate stalls against other service level metrics. This seems pretty critical when running a large number of asyncio applications in production because you can only currently _infer_ that a stall is happening, and it's hard to trace the cause across service boundaries. An event hook that was sent the loop and handle would be ideal for this.

asvetlov · 2022-03-19T12:00:58Z

I am still not convinced that audit events should be used.

Maybe support of explicit callbacks pair (on_start() + on_finish()) with None for fast-and-cheap "do nothing flag" is a better alternative for catching stale coroutines?

kumaraditya303 · 2022-09-29T08:55:20Z

I agree with Andrew that audit events should not be used here. Also the change will slow down the eventloop as it need to constantly monitor the time.

@gvanrossum What do you think?

gvanrossum · 2022-09-30T00:55:11Z

I presume this is about this code. (OP @orf, please confirm if you're still listening.)

I'm not sure I buy the reason why audit events are more desirable than using the logger -- the latter seems intended for these kinds of things, and it's logging at a pretty high level (warning, just short of error). But maybe we can make it so that you don't have to restart in debug mode, which I agree is annoying.

The key question is, can we afford to call self.time() twice for each handle._run() call? That seems excessive outside debug mode. Maybe we can time the entire for-loop even outside debug mode? It might require tweaking the slow_callback_duration variable (currently set to 0.1 second), or having a separate variable?

kumaraditya303 · 2022-10-23T17:55:13Z

@orf Are you still interested in this? If not then we can close this issue.

orf · 2022-10-23T18:53:36Z

Hey! I’m definitely still interested in the general idea of being able to accurately track loop stalls in production - to me this is absolutely critical.

We just use a variant of aiodebug, which works by patching Handle._run.

I do appreciate the discussion around this, but realistically I think the overhead of fetching the time or triggering an audit event is perfectly acceptable for real world applications that want to benefit from asyncio.

By not allowing this, or by locking it behind a “debug” flag with other overheads, we seem to be either targeting microbenchmarks or the elusive segment of production apps that are perfectly coded and so don’t require any kind of monitoring, but also require maximum throughput.

All this is to say, I don’t see how we can’t afford to pay this overhead in some way. And, ideally this overhead should be opt out, for the rare cases it’s not needed.

orf · 2022-10-23T19:01:56Z

Using the existing logger is one approach, but hooking it into something like datadog/stated is more effort than it should be, with a higher overhead.

I don’t know the right approach here to be honest. All I can phrase it as is “detecting and correlating loop stalls + performance issues in an automated way is more difficult than it could be”.

gvanrossum · 2022-10-24T18:06:30Z

What if we added a new API that allowed you to set a callback when a loop stall is detected? By default it would do nothing and skip the two time() calls (on my 3 year old Mac one monotonic() call takes around 100 nsec).

orf · 2022-10-24T22:36:41Z

I think a new API ion the loop for this would be perfect, and it’s a lot more explicit than a logger + an env var.

gvanrossum · 2022-10-25T20:55:44Z

Cool. Do you want to submit a PR? If you want to you we can first bikeshed the design a bit here (mostly the naming of the new method, and the signature of the callback, and whether there needs to be an API to get the current callback, etc. -- possibly following asyncio traditions).

orf · 2022-10-25T23:00:56Z

Sure! I’m going on holiday next week, but I’ll be able to work on a PR after.

One thing I wanted to consider was multiple callbacks. Should be constrain ourselves to a single callback? An integration library from datadog or Sentry would likely want to register their own callbacks here, which could cause conflicts.

gvanrossum · 2022-10-26T01:09:26Z

Yeah, that’s an eternal API design dilemma, it seems. We have this problem for signals. So let’s be robust and over-engineer it this time. :-)

orf mannequin added 3.11 only security fixes topic-asyncio type-feature A feature request or enhancement labels May 8, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

kumaraditya303 added pending The issue will be closed if no feedback is provided and removed 3.11 only security fixes labels Oct 23, 2022

kumaraditya303 added 3.12 bugs and security fixes topic-asyncio and removed topic-asyncio pending The issue will be closed if no feedback is provided labels Oct 26, 2022

erlend-aasland removed the 3.12 bugs and security fixes label Jan 5, 2024

erlend-aasland added the 3.13 bugs and security fixes label Jan 5, 2024

bedevere-app bot mentioned this issue Jan 5, 2024

gh-88241: Add asyncio.stalled audit hook #25990

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a PEP578 audit hook for Asyncio loop stalls #88241

Add a PEP578 audit hook for Asyncio loop stalls #88241

orf mannequin commented May 8, 2021

orf mannequin commented May 8, 2021

tiran commented May 8, 2021

orf mannequin commented May 8, 2021

zooba commented May 8, 2021

orf mannequin commented May 8, 2021

asvetlov commented Mar 19, 2022

kumaraditya303 commented Sep 29, 2022

gvanrossum commented Sep 30, 2022

kumaraditya303 commented Oct 23, 2022

orf commented Oct 23, 2022 •

edited

Loading

orf commented Oct 23, 2022

gvanrossum commented Oct 24, 2022

orf commented Oct 24, 2022

gvanrossum commented Oct 25, 2022

orf commented Oct 25, 2022 •

edited

Loading

gvanrossum commented Oct 26, 2022

Add a PEP578 audit hook for Asyncio loop stalls #88241

Add a PEP578 audit hook for Asyncio loop stalls #88241

Comments

orf mannequin commented May 8, 2021

orf mannequin commented May 8, 2021

tiran commented May 8, 2021

orf mannequin commented May 8, 2021

zooba commented May 8, 2021

orf mannequin commented May 8, 2021

asvetlov commented Mar 19, 2022

kumaraditya303 commented Sep 29, 2022

gvanrossum commented Sep 30, 2022

kumaraditya303 commented Oct 23, 2022

orf commented Oct 23, 2022 • edited Loading

orf commented Oct 23, 2022

gvanrossum commented Oct 24, 2022

orf commented Oct 24, 2022

gvanrossum commented Oct 25, 2022

orf commented Oct 25, 2022 • edited Loading

gvanrossum commented Oct 26, 2022

orf commented Oct 23, 2022 •

edited

Loading

orf commented Oct 25, 2022 •

edited

Loading