-
-
Notifications
You must be signed in to change notification settings - Fork 29.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a PEP578 audit hook for Asyncio loop stalls #88241
Comments
Detecting and monitoring loop stalls in a production asyncio application is more difficult than it could be. Firstly you must enable debug mode for the entire loop then you need to look for warnings outputted via the asyncio logger. This makes it hard to send loop stalls to monitoring systems via something like statsd. Ideally asyncio callbacks would always be timed and an auditevent always triggered if it passes a particular threshold. If debug mode is enabled then a warning is logged. |
Are you proposing to use PEP-578 for monitoring the event loop? |
I don't see why we shouldn't use PEP-578 for this - the events provide rich monitoring information about what a Python process is "doing" with an easy, central way to register callbacks to receive these events and shovel them off to a monitoring solution. Is there that much of a difference between monitoring the number of files, sockets, emails or even web browsers opened and the number of times an asyncio application has stalled? The alternative would be to make the loop stalling some kind of hookable event, which just seems like reinventing |
Fundamentally I don't have an issue with the audit hook. My only concern would be if there's anything that an application may do to _respond_ to a stall (e.g. is this valuable for applying backpressure? etc.) If it's purely diagnostic, and there's nothing you'd do in production when it happens, then an audit hook is perfect. |
Actually reacting to a stall would require something more and probably should be done at some point. But this is purely about monitoring - in our use case we'd send a metric via statsd that would be used to correlate stalls against other service level metrics. This seems pretty critical when running a large number of asyncio applications in production because you can only currently _infer_ that a stall is happening, and it's hard to trace the cause across service boundaries. An event hook that was sent the loop and handle would be ideal for this. |
I am still not convinced that audit events should be used. Maybe support of explicit callbacks pair (on_start() + on_finish()) with |
I agree with Andrew that audit events should not be used here. Also the change will slow down the eventloop as it need to constantly monitor the time. @gvanrossum What do you think? |
I presume this is about this code. (OP @orf, please confirm if you're still listening.) I'm not sure I buy the reason why audit events are more desirable than using the logger -- the latter seems intended for these kinds of things, and it's logging at a pretty high level (warning, just short of error). But maybe we can make it so that you don't have to restart in debug mode, which I agree is annoying. The key question is, can we afford to call |
@orf Are you still interested in this? If not then we can close this issue. |
Hey! I’m definitely still interested in the general idea of being able to accurately track loop stalls in production - to me this is absolutely critical. We just use a variant of aiodebug, which works by patching I do appreciate the discussion around this, but realistically I think the overhead of fetching the time or triggering an audit event is perfectly acceptable for real world applications that want to benefit from asyncio. By not allowing this, or by locking it behind a “debug” flag with other overheads, we seem to be either targeting microbenchmarks or the elusive segment of production apps that are perfectly coded and so don’t require any kind of monitoring, but also require maximum throughput. All this is to say, I don’t see how we can’t afford to pay this overhead in some way. And, ideally this overhead should be opt out, for the rare cases it’s not needed. |
Using the existing logger is one approach, but hooking it into something like datadog/stated is more effort than it should be, with a higher overhead. I don’t know the right approach here to be honest. All I can phrase it as is “detecting and correlating loop stalls + performance issues in an automated way is more difficult than it could be”. |
What if we added a new API that allowed you to set a callback when a loop stall is detected? By default it would do nothing and skip the two time() calls (on my 3 year old Mac one monotonic() call takes around 100 nsec). |
I think a new API ion the loop for this would be perfect, and it’s a lot more explicit than a logger + an env var. |
Cool. Do you want to submit a PR? If you want to you we can first bikeshed the design a bit here (mostly the naming of the new method, and the signature of the callback, and whether there needs to be an API to get the current callback, etc. -- possibly following asyncio traditions). |
Sure! I’m going on holiday next week, but I’ll be able to work on a PR after. One thing I wanted to consider was multiple callbacks. Should be constrain ourselves to a single callback? An integration library from datadog or Sentry would likely want to register their own callbacks here, which could cause conflicts. |
Yeah, that’s an eternal API design dilemma, it seems. We have this problem for signals. So let’s be robust and over-engineer it this time. :-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: