Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a PEP578 audit hook for Asyncio loop stalls #88241

Open
orf mannequin opened this issue May 8, 2021 · 16 comments
Open

Add a PEP578 audit hook for Asyncio loop stalls #88241

orf mannequin opened this issue May 8, 2021 · 16 comments
Labels
3.13 bugs and security fixes topic-asyncio type-feature A feature request or enhancement

Comments

@orf
Copy link
Mannequin

orf mannequin commented May 8, 2021

BPO 44075
Nosy @tiran, @asvetlov, @1st1, @zooba, @orf
PRs
  • gh-88241: Add asyncio.stalled audit hook #25990
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-05-08.12:19:46.330>
    labels = ['3.11', 'type-feature', 'expert-asyncio']
    title = 'Add a PEP578 audit hook for Asyncio loop stalls'
    updated_at = <Date 2022-03-19.12:00:57.728>
    user = 'https://github.com/orf'

    bugs.python.org fields:

    activity = <Date 2022-03-19.12:00:57.728>
    actor = 'asvetlov'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['asyncio']
    creation = <Date 2021-05-08.12:19:46.330>
    creator = 'orf'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44075
    keywords = ['patch']
    message_count = 6.0
    messages = ['393251', '393253', '393255', '393263', '393267', '415542']
    nosy_count = 5.0
    nosy_names = ['christian.heimes', 'asvetlov', 'yselivanov', 'steve.dower', 'orf']
    pr_nums = ['25990']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue44075'
    versions = ['Python 3.11']

    @orf
    Copy link
    Mannequin Author

    orf mannequin commented May 8, 2021

    Detecting and monitoring loop stalls in a production asyncio application is more difficult than it could be.

    Firstly you must enable debug mode for the entire loop then you need to look for warnings outputted via the asyncio logger. This makes it hard to send loop stalls to monitoring systems via something like statsd.

    Ideally asyncio callbacks would always be timed and an auditevent always triggered if it passes a particular threshold. If debug mode is enabled then a warning is logged.

    @orf orf mannequin added 3.11 only security fixes topic-asyncio type-feature A feature request or enhancement labels May 8, 2021
    @tiran
    Copy link
    Member

    tiran commented May 8, 2021

    Are you proposing to use PEP-578 for monitoring the event loop?

    @orf
    Copy link
    Mannequin Author

    orf mannequin commented May 8, 2021

    I don't see why we shouldn't use PEP-578 for this - the events provide rich monitoring information about what a Python process is "doing" with an easy, central way to register callbacks to receive these events and shovel them off to a monitoring solution.

    Is there that much of a difference between monitoring the number of files, sockets, emails or even web browsers opened and the number of times an asyncio application has stalled?

    The alternative would be to make the loop stalling some kind of hookable event, which just seems like reinventing sys.audit().

    @zooba
    Copy link
    Member

    zooba commented May 8, 2021

    Fundamentally I don't have an issue with the audit hook. My only concern would be if there's anything that an application may do to _respond_ to a stall (e.g. is this valuable for applying backpressure? etc.)

    If it's purely diagnostic, and there's nothing you'd do in production when it happens, then an audit hook is perfect.

    @orf
    Copy link
    Mannequin Author

    orf mannequin commented May 8, 2021

    Actually reacting to a stall would require something more and probably should be done at some point.

    But this is purely about monitoring - in our use case we'd send a metric via statsd that would be used to correlate stalls against other service level metrics. This seems pretty critical when running a large number of asyncio applications in production because you can only currently _infer_ that a stall is happening, and it's hard to trace the cause across service boundaries. An event hook that was sent the loop and handle would be ideal for this.

    @asvetlov
    Copy link
    Contributor

    I am still not convinced that audit events should be used.

    Maybe support of explicit callbacks pair (on_start() + on_finish()) with None for fast-and-cheap "do nothing flag" is a better alternative for catching stale coroutines?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @kumaraditya303
    Copy link
    Contributor

    I agree with Andrew that audit events should not be used here. Also the change will slow down the eventloop as it need to constantly monitor the time.

    @gvanrossum What do you think?

    @gvanrossum
    Copy link
    Member

    I presume this is about this code. (OP @orf, please confirm if you're still listening.)

    I'm not sure I buy the reason why audit events are more desirable than using the logger -- the latter seems intended for these kinds of things, and it's logging at a pretty high level (warning, just short of error). But maybe we can make it so that you don't have to restart in debug mode, which I agree is annoying.

    The key question is, can we afford to call self.time() twice for each handle._run() call? That seems excessive outside debug mode. Maybe we can time the entire for-loop even outside debug mode? It might require tweaking the slow_callback_duration variable (currently set to 0.1 second), or having a separate variable?

    @kumaraditya303
    Copy link
    Contributor

    @orf Are you still interested in this? If not then we can close this issue.

    @kumaraditya303 kumaraditya303 added pending The issue will be closed if no feedback is provided and removed 3.11 only security fixes labels Oct 23, 2022
    @orf
    Copy link
    Contributor

    orf commented Oct 23, 2022

    Hey! I’m definitely still interested in the general idea of being able to accurately track loop stalls in production - to me this is absolutely critical.

    We just use a variant of aiodebug, which works by patching Handle._run.

    I do appreciate the discussion around this, but realistically I think the overhead of fetching the time or triggering an audit event is perfectly acceptable for real world applications that want to benefit from asyncio.

    By not allowing this, or by locking it behind a “debug” flag with other overheads, we seem to be either targeting microbenchmarks or the elusive segment of production apps that are perfectly coded and so don’t require any kind of monitoring, but also require maximum throughput.

    All this is to say, I don’t see how we can’t afford to pay this overhead in some way. And, ideally this overhead should be opt out, for the rare cases it’s not needed.

    @orf
    Copy link
    Contributor

    orf commented Oct 23, 2022

    Using the existing logger is one approach, but hooking it into something like datadog/stated is more effort than it should be, with a higher overhead.

    I don’t know the right approach here to be honest. All I can phrase it as is “detecting and correlating loop stalls + performance issues in an automated way is more difficult than it could be”.

    @gvanrossum
    Copy link
    Member

    What if we added a new API that allowed you to set a callback when a loop stall is detected? By default it would do nothing and skip the two time() calls (on my 3 year old Mac one monotonic() call takes around 100 nsec).

    @orf
    Copy link
    Contributor

    orf commented Oct 24, 2022

    I think a new API ion the loop for this would be perfect, and it’s a lot more explicit than a logger + an env var.

    @gvanrossum
    Copy link
    Member

    Cool. Do you want to submit a PR? If you want to you we can first bikeshed the design a bit here (mostly the naming of the new method, and the signature of the callback, and whether there needs to be an API to get the current callback, etc. -- possibly following asyncio traditions).

    @orf
    Copy link
    Contributor

    orf commented Oct 25, 2022

    Sure! I’m going on holiday next week, but I’ll be able to work on a PR after.

    One thing I wanted to consider was multiple callbacks. Should be constrain ourselves to a single callback? An integration library from datadog or Sentry would likely want to register their own callbacks here, which could cause conflicts.

    @gvanrossum
    Copy link
    Member

    Yeah, that’s an eternal API design dilemma, it seems. We have this problem for signals. So let’s be robust and over-engineer it this time. :-)

    @kumaraditya303 kumaraditya303 added 3.12 bugs and security fixes topic-asyncio and removed topic-asyncio pending The issue will be closed if no feedback is provided labels Oct 26, 2022
    @erlend-aasland erlend-aasland removed the 3.12 bugs and security fixes label Jan 5, 2024
    @erlend-aasland erlend-aasland added the 3.13 bugs and security fixes label Jan 5, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes topic-asyncio type-feature A feature request or enhancement
    Projects
    Status: Todo
    Development

    No branches or pull requests

    7 participants