Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutError not raised by 'timeout' context manager when "side errors" occur #96037

Open
gvanrossum opened this issue Aug 17, 2022 · 6 comments
Labels
3.11 3.12 type-feature A feature request or enhancement

Comments

@gvanrossum
Copy link
Member

gvanrossum commented Aug 17, 2022

Consider something like this:

async with asyncio.timeout(1):
    try:
        await asyncio.sleep(2)  # Will be interrupted after 1 second
    finally:
        1/0  # Crash in cleanup

This will report ZeroDivisionError but suppresses the fact that it happened during cleanup. Once #95761 lands the same problem will happen if the try/finally block happens to be in a task created using TaskGroup.

For end users who are looking why their code crashes it may be useful to know that the timeout context manager decided to cancel the thread, and that their crash was induced by the cancellation.

There are various ways we could solve this -- e.g. we might consider raising an ExceptionGroup that combines TimeoutError with the crash exception, but that would lead people to believe that they have to write except* TimeoutError to catch timeouts, and we don't want that (especially not since in 3.11 they don't have to).

I discussed this with @iritkatriel and she proposed that the timeout context manager could add a note to the crash exception (see PEP 678). For maximal usability the note should probably dig up the filename and linenumber of the caller and add those to its error message, so the note would read something like "Timed out at file <file>, line <line>" (details TBD by whoever writes a PR for this).

@gvanrossum gvanrossum added type-feature A feature request or enhancement 3.11 3.12 labels Aug 17, 2022
@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Aug 18, 2022

I think that __context__ of the new exception should be set to TimeoutError. It should not be different from:

    try:
        raise TimeoutError
    finally:
        1/0  # Crash in cleanup

@iritkatriel
Copy link
Member

iritkatriel commented Aug 18, 2022

I think that __context__ of the new exception should be set to TimeoutError.

That makes sense, but I think not obvious how to implement correctly - we would need to make it equivalent also to things like

try:
   raise TimeoutError
finally:
   try:
       raise ValueError
   except:
       1/0

(The timeout manager would need to chain the TimeoutError to the ValueError, not the ZeroDivisionError).

@iritkatriel
Copy link
Member

iritkatriel commented Aug 18, 2022

Another potential complication: I may be wrong, but I think it may be possible to get a cancellation due to timeout after one of the tasks raised an exception (while waiting for the other tasks to complete) and then this exception is parallel to the TimeoutError, not originating from within its context.

@iritkatriel
Copy link
Member

iritkatriel commented Aug 18, 2022

Actually I think this is a correct way to implement @serhiy-storchaka 's suggestion:

Before running the cleanup code of a cancelled task, set exc_info to be the TimeoutError, so it will be the context of whatever exception happens during cleanup. It won't be the context of the whole ExceptionGroup, but that's actually right because (as per my previous comment) I think there could be exceptions in the group that were not from cleanup code, and should not be in the context of the timeout.

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 18, 2022

But task cancellation may not be due to TimeoutError -- there are many other reasons to cancel. When a task is cancelled, it is made runnable and once the scheduler runs it, a CancelledError is thrown into the task's coroutine. The exc_timeout is set to this CancelledError. If you try this with master or the head of the 3.11 branch, you can see this. Here's a demo program (tt.py):

import asyncio

async def task():
    try:
        await asyncio.sleep(2)  # Will be interrupted after 1 second
    finally:
        1/0  # Crash in cleanup

async def main():
    async with asyncio.timeout(1):
        async with asyncio.TaskGroup() as tg:
            tg.create_task(task())
            await asyncio.sleep(1.5)

asyncio.run(main())

Here's the traceback when I run it:

  + Exception Group Traceback (most recent call last):
  |   File "/Users/guido/cpython/tt.py", line 15, in <module>
  |     asyncio.run(main())
  |   File "/Users/guido/cpython/Lib/asyncio/runners.py", line 189, in run
  |     return runner.run(main)
  |            ^^^^^^^^^^^^^^^^
  |   File "/Users/guido/cpython/Lib/asyncio/runners.py", line 117, in run
  |     return self._loop.run_until_complete(task)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 650, in run_until_complete
  |     return future.result()
  |            ^^^^^^^^^^^^^^^
  |   File "/Users/guido/cpython/tt.py", line 11, in main
  |     async with asyncio.TaskGroup() as tg:
  |   File "/Users/guido/cpython/Lib/asyncio/taskgroups.py", line 135, in __aexit__
  |     raise me from None
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/Users/guido/cpython/tt.py", line 5, in task
    |     await asyncio.sleep(2)  # Will be interrupted after 1 second
    |     ^^^^^^^^^^^^^^^^^^^^^^
    |   File "/Users/guido/cpython/Lib/asyncio/tasks.py", line 644, in sleep
    |     return await future
    |            ^^^^^^^^^^^^
    | asyncio.exceptions.CancelledError
    | 
    | During handling of the above exception, another exception occurred:
    | 
    | Traceback (most recent call last):
    |   File "/Users/guido/cpython/tt.py", line 7, in task
    |     1/0  # Crash in cleanup
    |     ~^~
    | ZeroDivisionError: division by zero
    +------------------------------------

No TimeoutError is ever raised in this case -- the timeout context manager's __aexit__ decides not to raise it, see the code.

If I add a note to the timeout __aexit__ at the point where it doesn't raise TimeoutError, it gets printed right after | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) in the above traceback.

Instead of adding a note, we could set the __cause__ field of exc_val to a fresh TimeoutError instance, like this:

                exc_val.__cause__ = TimeoutError("This exception was caused by a timeout")

Then in the above example a TimeoutError is shown before the exception group:

TimeoutError: This exception was caused by a timeout

The above exception was the direct cause of the following exception:

  + Exception Group Traceback (most recent call last):
  |   File "/Users/guido/cpython/tt.py", line 15, in <module>
  (etc.)

(If we wanted to go this route we probably should set the cause of the new TimeoutError instance to the original cause.)

Honestly the note looks better -- we can get a TimeoutError as the cause but we cannot make its traceback show something useful. Adding a line+file note is easy:

            if exc_val is not None:
                f = sys._getframe(1)
                exc_val.add_note(f"This exception was the result of a timeout at line {f.f_lineno} in {f.f_code.co_filename}")

Now the output includes

  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  | This exception was the result of a timeout at line 10 in /Users/guido/cpython/tt.py

(Line 10 is async with asyncio.timeout(1):.)

@gvanrossum
Copy link
Member Author

gvanrossum commented Aug 18, 2022

Another potential complication: I may be wrong, but I think it may be possible to get a cancellation due to timeout after one of the tasks raised an exception (while waiting for the other tasks to complete) and then this exception is parallel to the TimeoutError, not originating from within its context.

You are not wrong. One task could crash (i.e. raise), which causes its sibling tasks to be cancelled (this is standard TG behavior), and then some sibling (or several) could crash while handling the CancelledError (e.g. in a finally clause as in my original example). The fact that the sibling crashed during cancellation cleanup is clear from the context of the crash exception: it is preceded by a traceback for a CancelledError and the phrase "During handling of the above exception, another exception occurred:".

And due to the async nature of all this, the timeout's _on_timeout handler could run after the first sibling crashes but before the second sibling gets to run its cleanup code. I haven't written a test for this scenario (it's difficult because it depends on a race condition) but I think in this case the sibling may be cancelled twice. And yet the timeout's __aexit__ will not be in the traceback. That's the thing we'd like to do something about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 3.12 type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants