Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskGroup suppresses grouped exceptions due to parent task cancellation #95704

Open
graingert opened this issue Aug 5, 2022 · 11 comments
Open
Labels
3.11 3.12 expert-asyncio type-bug

Comments

@graingert
Copy link
Contributor

graingert commented Aug 5, 2022

TaskGroup suppresses grouped exceptions due to parent task cancellation

The suggestion was to always check for Cancellation in every finally block in async context managers, however sometimes there's no finally block visible and the problem applies to synchronous context managers too

In this demo I'd expect to be able to print: "handled attribute error"

import asyncio
import contextlib

async def child(sock):
    try:
        with contextlib.closing(sock):
            await asyncio.sleep(2)
    except Exception as e:
        print(f"child task got error: {type(e)=} {e=}")
        raise

class Sock:
    async def aclose(self):
        await asyncio.sleep(1)

async def main():
    try:
        sock = Sock()
        async with asyncio.timeout(1):
            async with asyncio.TaskGroup() as tg:
                # Make two concurrent calls to child()
                tg.create_task(child(Sock()))
                tg.create_task(child(Sock()))
                async with contextlib.aclosing(sock):
                    await asyncio.sleep(2)
    except* AttributeError:
        print("handled attribute error")


asyncio.run(main())
@graingert graingert added type-bug expert-asyncio 3.11 3.12 labels Aug 5, 2022
@gvanrossum
Copy link
Member

gvanrossum commented Aug 6, 2022

I walked through this one with a debugger and you're right that we can do better. It seems the essence of the reproducer is that we enter TaskGroup__aexit__ with a CancellationError but without having cancelled any tasks yet, and then one or more tasks catch the CancellationError and raise something else (in your case, AttributeError because contextlib.closing() calls sock.close() during cleanup, which doesn't exist). This code seems to repro the same issue but without a fake socket class or context managers, and with only one child:

import asyncio

async def child():
    try:
        await asyncio.sleep(2)
    except asyncio.CancelledError:
        print("** Cancelled **")
        raise RuntimeError

async def main():
    async with asyncio.TaskGroup() as tg:
        tg.create_task(child())
        await asyncio.sleep(0.5)
        raise asyncio.CancelledError

asyncio.run(main())

It prints ** Cancelled ** and then raises CancelledError, whereas I agree that it would be more useful to report the RuntimeError

Alas, the obvious fix (don't honor propagate_cancellation_error if self._errors is non-empty) breaks two of the asyncio tests.

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

Looks like this is the correct fix:

--- a/Lib/asyncio/taskgroups.py
+++ b/Lib/asyncio/taskgroups.py
@@ -115,7 +115,8 @@ async def __aexit__(self, et, exc, tb):
         if self._base_error is not None:
             raise self._base_error

-        if propagate_cancellation_error is not None:
+        if (propagate_cancellation_error is not None and
+                (self._parent_cancel_requested or not self._errors)):
             # The wrapping task was cancelled; since we're done with
             # closing all child tasks, just propagate the cancellation
             # request now.

@grangert Do you agree? If so, do you want to make a PR with the new test added?

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

Spoke too soon, now test_cancellation_in_body fails. :-(

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

And so does test_taskgroup_11. It's a game of whack-a-mole.

Oh wait, sometimes I get no failures, sometimes I get test_taskgroup_12 failing. FWIW I am testing on Windows.

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

Interestingly I don't get any failures on Mac UPDATE: it crashes much less frequently on Mac. What do you get on Linux?

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

So test_taskgroup_11 has a race condition. Each of the 5 tasks created sleeps 0.1 second and then crashes, raising ZeroDivisionError. The main task sleeps 0.1 second and then cancels the runner task, which cancels the crashing tasks. The test is written to always expect a CancelledError, but if any of the 5 crashing tasks finishes its sleep before being cancelled, it will raise ZeroDivisionError and the change to TaskGroup.__aexit__ makes it so that in that cases it raises an ExceptionGroup wrapping the ZeroDivisionError(s).

IIUC test_taskgroup_12 has the same race, the difference with test_taskgroup_11 is that it uses two nested TaskGroups.

It looks like test_cancellation_in_body has the same race condition again. (This looks like it used to be called test_taskgroup_08 in EdgeDb; @Tinche renamed it when he added a cancel message to it in #31559 -- which was then removed again by @asvetlov when he deprecated cancel messages. :-)

Now I'd love to ask @1st1 what his intention was with these tests -- did he want them to be "racy"? (That would have been annoying, because racy tests just destabilize the test suite.) Assuming not, is the scenario the tests attempts to test the one where the crashers raise ZeroDivisionError, or the one where they are cancelled in their sleep?

Discuss.

@Tinche
Copy link
Contributor

Tinche commented Aug 7, 2022

Going through the thread. Going by the 'errors should never pass silently' I also agree we should make it so the example from #95704 (comment) raises an EG with RuntimeError.

Assuming not, is the scenario the tests attempts to test the one where the crashers raise ZeroDivisionError, or the one where they are cancelled in their sleep?

I copy/pasted those tests into quattro, and at that time for test_taskgroup_08 I added the following docstring: """Cancelling a TG does not propagate children exceptions.""" (but I wrote it as my interpretation of the test). So I'm guessing the children were supposed to raise, and those exceptions were supposed to get swallowed.

@Tinche
Copy link
Contributor

Tinche commented Aug 7, 2022

Oh, and the quattro test isn't flaky, at least I couldn't make it fail. But it is racy, sometimes the children get cancelled and sometimes they don't, but the test passes in both cases, so it's still a bad test.

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

Thanks, @Tinche. What would quattro do with the example from #95704 (comment) (assuming that's a meaningful question)? Does it raise CancelledError or (an EG wrapping a) RuntimeError? If it raises CancelledError, that explains why the test isn't flaky for quattro -- it's not flaky in asyncio on main either, but it would be if we were to commit my fix to taskgroups.py without fixing those three racy tests.

Your docstring might well express the intent of the test, but @graingert's complaint here is that cancelling a TG should propagate the children's exceptions.

Come to think of it, maybe this intent explains the raciness of the test: the idea might have been to cancel the test before any of the children have crashed, but to have at least one child crash before the cancellation reaches TaskGroup.__aexit__. (Hence creating 5 children -- to increase the chance of at least one of them actually crashing.) @graingert's demo (or my reduction of it) would seem to be a much more reliable way to elicit this effect: the child task crashes during cleanup, after it has caught the CancelledError.

So I'm guessing the children were supposed to raise, and those exceptions were supposed to get swallowed.

But they weren't supposed to raise too soon -- note the line (after the main task has slept 0.1 seconds, and just before it cancels the runner task)

        self.assertFalse(r.done())

which indicates that we expect the runner to still be running -- IOW the TG hasn't exited yet, which looks like a proxy for the assertion that the TG hasn't processed any crashing child tasks yet, and that in turn is attempting to assert that no child task has crashed yet at this point.

So shortening the sleep in the crasher task would defeat the purpose of the test, and make it fail -- the TG would self-cancel when the first child crashes before the main task can cancel the runner task. But lengthening the sleep in the crasher would also defeat the purpose of the test, while making it pass -- the crasher tasks would never crash.

In conclusion I think the three affected tests might have to be modified to use the following crasher:

        async def foo():
            try:
                await asyncio.sleep(1)
            finally:
                1 / 0

and they would have to be changed to expect an EG wrapping ZeroDivisionError instead of CancellationError (not wrapped by an EG).

@Tinche
Copy link
Contributor

Tinche commented Aug 7, 2022

What would quattro do with the example from #95704 (comment)

It raises CancelledError. But you have to remember the quattro TG is the EdgeDB TG with minimal changes (type hints and tweaks).

Your docstring might well express the intent of the test, but @graingert's complaint here is that cancelling a TG should propagate the children's exceptions.

Yeah, I think it makes sense to change it in asyncio, and quattro will follow whatever asyncio does here.

Come to think of it, maybe this intent explains the raciness of the test

I wager you're right. I personally wouldn't write the test this way though, I would write several tests for each scenario, so I vote for rewriting these tests.

In conclusion I think the three affected tests might have to be modified

Can we modify the tests to avoid sleeps with these hardcoded timeouts?

What about rewriting test_cancellation_in_body like this:

async def test_cancellation_in_body_child_propagation(self):
        """
        If the task running the TaskGroup is blocked on something and then
        cancelled, the TG children are cancelled too.

        If the children raise an exception different than `CancelledError`
        while being stopped, those exceptions are raised in an ExceptionGroup.
        """
        LONG_SLEEP = 10  # The awaiter is expected to be cancelled waiting on this.

        exc = RuntimeError()

        async def foo():
            try:
                await asyncio.sleep(LONG_SLEEP)
            except CancelledError:
                # We raise, and expect it to be propagated.
                raise exc  # Something other than a CancelledError.

        async def runner():
            async with taskgroups.TaskGroup() as g:
                g.create_task(foo())

                await asyncio.sleep(LONG_SLEEP)

        r = asyncio.create_task(runner())
        await asyncio.sleep(0.1)

        self.assertFalse(r.done())
        r.cancel()
        with self.assertRaises(ExceptionGroup) as cm:
            await r

        self.assertEqual(cm.exception == ExceptionGroup([exc]))

(I didn't actually run this because I don't have your changes, but you get the idea.)

@gvanrossum
Copy link
Member

gvanrossum commented Aug 7, 2022

Just about, although that's not the way to test for an EG.

gvanrossum added a commit to gvanrossum/cpython that referenced this issue Aug 7, 2022
When a task catches CancelledError and raises some other error,
the other error should not silently be suppressed.

Exception: when uncancel() != 0, always propagate CancelledError.

NOTE: This represents a change in behavior (hence the need to
change three tests).  But it is only an edge case.

We need to discuss with the RM whether to backport to 3.11RC2,
or 3.11.1, or not at all.
gvanrossum added a commit that referenced this issue Aug 17, 2022
When a task catches CancelledError and raises some other error,
the other error should not silently be suppressed.

Any scenario where a task crashes in cleanup upon cancellation
will now result in an ExceptionGroup wrapping the crash(es)
instead of propagating CancelledError and ignoring the side errors.

NOTE: This represents a change in behavior (hence the need to
change several tests).  But it is only an edge case.

Co-authored-by: Thomas Grainger <tagrain@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 17, 2022
…pythonGH-95761)

When a task catches CancelledError and raises some other error,
the other error should not silently be suppressed.

Any scenario where a task crashes in cleanup upon cancellation
will now result in an ExceptionGroup wrapping the crash(es)
instead of propagating CancelledError and ignoring the side errors.

NOTE: This represents a change in behavior (hence the need to
change several tests).  But it is only an edge case.

Co-authored-by: Thomas Grainger <tagrain@gmail.com>
(cherry picked from commit f51f54f)

Co-authored-by: Guido van Rossum <guido@python.org>
miss-islington added a commit that referenced this issue Aug 17, 2022
…5761)

When a task catches CancelledError and raises some other error,
the other error should not silently be suppressed.

Any scenario where a task crashes in cleanup upon cancellation
will now result in an ExceptionGroup wrapping the crash(es)
instead of propagating CancelledError and ignoring the side errors.

NOTE: This represents a change in behavior (hence the need to
change several tests).  But it is only an edge case.

Co-authored-by: Thomas Grainger <tagrain@gmail.com>
(cherry picked from commit f51f54f)

Co-authored-by: Guido van Rossum <guido@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 3.12 expert-asyncio type-bug
Projects
None yet
Development

No branches or pull requests

3 participants