bpo-46771: Implement asyncio context managers for handling timeouts #31394

asvetlov · 2022-02-17T21:51:25Z

https://bugs.python.org/issue46771

asvetlov

@Tinche this is the first sketch.
It has no tests yet, sorry.

The idea is: let's use CancelScope class.
The class has a state.
If timeout occurs, the state is switched to cancelled and task.cancel() is called.
The fact of timeout occurrence is detected by the state check (self._state == _State.CANCELLED).
The explicit state machine is very useful, it simplifies both designing and debugging.

move_on_* family is renamed with cancel_scope + cancel_spoce_at.

fail_* family is renamed with timeout + timeout_at

Sync context managers are used to visually guarantee no suspension points.

I didn't implement CancelScope.cancel() because it is barely equal to task.cancel(). Can be added if we decide to support this.

Ironically, the cancellation message is not needed for implementation because the cancellation flag is set by call_at() callback. If something swallows raised CancelledError -- the flag is still set. We can add two flags actually: cancellation was raised and caught; I'm not sure if it is really needed. If we decide 'yes' .cancelling() flag can be added along with '.cancelled()'.

I'll try to add tests later.
Sorry, tomorrow I'll be traveling and inactive most likely.

@gvanrossum you may be interested in this preliminary design. It is a mix of trio-styled cancellation scopes and asyncio specific.

The code is pretty simple and understandable, at least from my point of view.
Sure, need to write a comprehensive test suite.
I'll gather tests from both async-timeout and quattro, plus add tests for mentioned edge cases.

Lib/asyncio/timeouts.py

Tinche · 2022-02-18T16:17:17Z

@asvetlov Thanks for starting this, I will dedicate some time over the weekend. How can I contribute tests to this though? You invited me to asvetlov/cpython, but this PR is from python/cpython. Sorry, I have not done this before :)

asvetlov · 2022-02-18T16:37:36Z

Clone my repo locally git clone git@github.com:asvetlov/cpython.git
Use issue-46771 branch to work on.
When you push into this remote branch, the upstream python/cpython reflects all changes.

BTW, I use GitHub CLI (https://cli.github.com/), it allows to checkout forks very easily.
Look on the top of this PR page for line asvetlov wants to merge 9 commits into main from issue-46771.

Click 'copy-to-clipboard' button right to the branch name.
Run gh pr checkout <paste-from-clipboard>.

The tool setups remote, checkouts branch, and switches to it.
git commit and git push updates the PR even if it is from a fork if you have access (admins and maintainers can edit contributors' PRs). Super easy.

asvetlov · 2022-02-18T16:50:13Z

Sorry, I was over-optimistic :(
A cancellation message is still required if CancelScope wants to swallow self-initiated cancellations (as Quatro does).

Tinche · 2022-02-18T17:03:58Z

Thanks for the explanation. The asvetlov repo has no issue-46771 branch but gh pr checkout did the trick.

Can CancelScope be a dataclass?

Tinche · 2022-02-18T17:14:59Z

@asvetlov Ok, for the discussion about more sophisticated nonce handling. Would you like me to do a writeup here, or open an issue on the asvetlov repo? I will try to explain the issue and a proposed solution.

You were right saying on the mailing list that asyncio doesn't have priorities, but I think with a little logic we can do better and maybe save our users from a potential footgun.

1st1 · 2022-02-18T17:26:05Z

@asvetlov

A cancellation message is still required if CancelScope wants to swallow self-initiated cancellations (as Quatro does).

Can you add a test that can only be fixed with a nonce? I'd like to take a look, because I'm still -1 on the nonce idea. We should figure out another way.

gvanrossum

Looking at the API only.

Lib/asyncio/timeouts.py

gvanrossum · 2022-02-18T17:33:11Z

I'm not sure if the discussion about cancel edge cases should move here, but my comments on that so far are:

Yes, there really is something that we need to solve.
Yes, the cancel-message solution works (I have a version for TaskGroup in a branch).
Solutions introducing a new "nonce" field seem to just add complications (e.g. the CancelledError exception would have to be modified to accept the nonce arg and store it, or if we make it positional, we'd end up with the awkward CancelledError(None, nonce) if there's a nonce but no message -- and usually there isn't a message.
If you make the nonce a float and order them, it's no longer a nonce. And despite using time.monotonic(), using timestamps to deal with races feels wrong.

Tinche · 2022-02-18T18:51:05Z

@1st1 Here's a test that fails without nonces:

    async def test_nested_timeouts_concurrent(self):
        with self.assertRaises(TimeoutError):
            with asyncio.timeout(0.002):
                try:
                    with asyncio.timeout(0.003):
                        # Pretend we crunch some numbers.
                        time.sleep(0.005)
                        await asyncio.sleep(1)
                except asyncio.TimeoutError:
                    pass

Both timeouts are marked as cancelled, but the inner timeout swallows the cancellation for the outer timeout. So the outer timeout never triggers.

asvetlov · 2022-02-18T18:55:23Z

@Tinche up to you.
I wanted to share access to work on timeouts/cancellation scopes work with you.
As the cpython core, I can write to any of your pull requests if you don't clear the corresponding check box.
You are not core dev (yet), so I granted you the write access to my clone. You cannot write to the main repo but can apply changes without PR-over-PR, apply, and merge workflow.
I believe it reduces friction.
If you want to demonstrate an alternative approach and a separate branch and PR works better for you -- please go ahead.
If you have a feeling that direct commit to this PR is more comfortable -- I'm fine with it.
For example, adding a new test is a perfect example for the second case.

gvanrossum · 2022-02-18T20:50:54Z

@1st1 Here's a test that fails without nonces:

    async def test_nested_timeouts_concurrent(self):
        with self.assertRaises(TimeoutError):
            with asyncio.timeout(0.002):
                try:
                    with asyncio.timeout(0.003):
                        # Pretend we crunch some numbers.
                        time.sleep(0.005)
                        await asyncio.sleep(1)
                except asyncio.TimeoutError:
                    pass

Both timeouts are marked as cancelled, but the inner timeout swallows the cancellation for the outer timeout. So the outer timeout never triggers.

Thinking about what happens here: When we reach the await asyncio.sleep(1), both callbacks run, both try to cancel the task, the inner __exit__() receives CancelledError, uncancels the task, and raises TimeoutError, which is called by the user code's except clause. Then the outer cancel scope's __exit__() is entered without an exception, and overall the task completes successfully.

The expectation is that the outer cancel scope should also raise TimeoutError, because its deadline is also exceeded. So perhaps the outer __exit__() should check whether its deadline is exceeded, and raise TimeoutError even though it did not see a CancelledError?

But it looks like there's a variant of the example that isn't fixed by such a check -- for example, we could add

        await asyncio.sleep(2)

after the try/except block (or even in the except clause), inside the outer cancel scope. Since the CancelledError exception has been wholly swallowed by the inner __exit__() this second await is not interrupted.

I'm guessing the fix for that would be to use a nonce (either using an explicit nonce API or via the cancel message). However, that still depends on the order in which the callbacks run. With the new cancel semantics (where extra cancels are ignored) whoever runs first wins, while with the old cancel semantics (where the most recent cancel gets to set the cancel message / nonce) whoever runs last wins -- but we don't want to rely on the order in which the callbacks run (since I have no idea in which order the current implementation runs them, given that they both become "ready" simultaneously, and we shouldn't depend on that). So that's why Tin is proposing to use timestamps (or a cancel stack, or some other mechanism that lets us arbitrate between the two cancel calls).

Note that this case is slightly different from the "web server cancels" problem, which was solvable without resorting to timestamps (the cancel scope callback would have to check whether the task was already being cancelled, using t.cancelling()).

I'm not sure yet what the right solution would be, but I'm glad that we're thingking about this scenario carefully.

Tinche · 2022-02-18T22:43:17Z

@gvanrossum thanks for the great write up, I believe you hit the nail on the head. I think to get this 100% correct, we need a way to throw the highest priority cancellation into the task on its iteration. The highest priority is the cancel scope that was entered first. It doesn't have to be a timestamp, we could have a global (or thread-local, but that's overthinking it) counter that the cancel scope takes from on __enter__.

The cancellation data could also be in a context var instead of on the CancelledError, I suppose? Would that be cleaner?

gvanrossum · 2022-02-19T15:36:41Z

I have a new proposal. Thinking about @asvetlov's diagrams some more, I think we can make everything work if the cancel state of a task kept a count of cancellations. This would replace the t._cancel_requested flag -- we could just make that flag an int instead of a bool. t.cancel() increments the flag. t.uncancel() decrements it (asserting that it stays >= 0). t.cancelling() returns the value of the flag, i.e. the number of pending cancellations.

Now any context manager cleanup handler, like CancelScope.__exit__() or TaskGroup.__aexit__(), when it has to decide whether to let a CancellationError bubble out or swallow it, can just look at the number of pending cancellations to decide. Such context managers should still keep track of whether they cancelled the task themselves or not (at least TaskGroup needs this to avoid cancelling multiple times), and if they did, they should always call t.uncancel(). If after that's done there's still a nonzero number of pending cancellations (as reported by t.cancelling()), they should let the cancellation bubble out, other wise they can swallow or replace it.

The only thing that's painful with this scheme is how to handle cancel messages. Hoever, (this is for @cjerdonek to disagree with :-) I think cancel messages were a bad idea to begin with -- if I read the bpo where it started, it came from a user who misunderstood how to print exceptions. It seems they are not always preserved regardless. I believe that any scheme that allows for multiple pending cancellations will have a hard time to keep track of which cancel message should be allowed to bubbled out. I'm guessing it could be solved by making t._cancel_requested a list, and giving t.uncancel() an optional argument that indicates which cancel message to remove from that list (None by default), but that feels like a complication I'd much rather avoid.

Tinche · 2022-02-19T16:29:13Z

@gvanrossum what happens to the count when the exception is thrown? Does it reset to 0 (I guess this would have to happen after the entire iteration is done) or does it stay, and only t.uncancel() can decrement it?

If it stays, I think essentially you have level-triggered cancellation then. Which I like, but it would be a compatibility break, right?

gvanrossum · 2022-02-19T17:00:28Z

It stays, but this is not the flag that causes the exception to be thrown -- for that we have _must_cancel, which remains a bool and is reset when the exception is actually thrown (this reset logic is here, the actual throw here.

So I don't think it's quite the same as level-triggered, IIUC (that would keep interrupting every await governed by the cancel scope until its __exit__() is reached, right?).

Tinche · 2022-02-19T21:26:39Z

Cool, I'm on board with this proposal then. (I got a little confused on _must_cancel vs _cancel_requested.) Want me to tinker around and try implementing this?

Your understanding of level-based cancellation is correct. In practice it helps remove a few footguns, especially around finally clauses. I feel like this counter approach might open the door to enable it (on an opt-in basis) in a future version of Python, but that's not a concern for the matter at hand.

gvanrossum · 2022-02-19T21:28:46Z

Sure, go ahead and send a draft PR!

Tinche · 2022-02-20T01:15:43Z

Got a working version here: #31434.

gvanrossum · 2022-02-22T01:42:17Z

Now that I've reverted the cancel counts some of the tests fail. If the plan is to merge the cancel count PR into this branch I will leave them in here, but if this branch is getting merged into main before that I will take them out too.

My intention is to first get the cancel counts PR agreed upon and landed, and then timeouts -- so the latter can depend on the former. (I haven't found your new cancel count PR yet -- can you link to it when you have it?)

agronholm · 2022-02-22T10:12:46Z

I'm still not fully convinced that the cancel counter system works in every imaginable scenario. I will attempt to poke holes in it by constructing an extremely pathological scenario, either tonight or tomorrow. If I can't make it fail, I'll give the thumbs up.

Tinche · 2022-02-22T19:11:54Z

@gvanrossum @asvetlov #31508 here we go, let me know if we want a different bpo in the PR title.

Tinche · 2022-02-22T23:30:07Z

Looks like there had been a misunderstanding, the new PR is here: #31513.

gvanrossum · 2022-02-26T16:35:55Z

Hey @asvetlov, did you just git rebase instead of git merge?

agronholm · 2022-02-27T12:46:47Z

It's taken me a while but I've been looking at cancel counters and I was not able to make them fail when all the involved context manager support uncancellation properly. There are other considerations, however, which make me still favor semantic changes to the current main branch. To understand everything perfectly I would need to know the exact motivations for adding uncancellation to tasks in the first place – what do task groups use it for (asking sincerely)? I looked at the code but I wasn't really sure.

gvanrossum · 2022-02-27T16:35:52Z

Suppose the task group user does something like this:

async def parent():
    async with asyncio.TaskGroup() as g:
        g.create_task(some_coro())
        await asyncio.sleep(1)
    print("Done")

This will normally print "Done" after the task some_coro() completes, but no sooner than 1 second after the start.

Several special cases exist:

Maybe the sleep() call is interrupted from "outside", e.g. perhaps the parent is wrapped in async with timeout(0.5); see comment in the code.
Maybe some_coro() fails before sleep() completes. In this case we need to interrupt sleep() so that we can raise (an exception group with) the exception from some_coro(); see comment in the code.
Both of these things may happen simultaneously, i.e. we may decide to cancel the parent task when it was already cancelled.

All these events will cause __aexit__() to be entered with a CancelledError.

In the first case, we need to abort the still-running tasks and than propagate the CancelledError out -- we detect this here and raise it here.
In the second case, we need to suppress the cancellation. That happens here.
When both happen at once, self._parent_task.uncancel() will return 1 or higher and we will let the CancelledError bubble out.

All in all these lines are crucial -- here we call parent_task.uncancel() and depending on the outcome decide to swallow or propagate the CancelledError. Note that normally when .uncancel() is called there will be only one pending cancellation (ours) so that it will return 0, so we swallow it. But if both events happened simultaneously, it will return 1 and we don't swallow.

Now, instead of calling .cancel() and then later .uncancel(), we could also decide to check .cancelling() (which returns the pending cancellation count) and not call .cancel() at all. Then when a CancelledError comes in to __aexit__(), we will know we didn't initiate it, and we'll propagate it out. But this falls afoul of the scenario where an outside cancellation happened after we have already decided to cancel (but before anything acted on it). How could that happen? It wouldn't if everyone stuck to the protocol of checking .cancelling() before calling .cancel(), so there would only ever be one pending cancellation. But it could if there was some pre-3.11 user code that called .cancel() without checking .cancelling(). By using matched .cancel()/.uncancel() pairs, task groups and timeouts (cancel scopes) are impervious to such race conditions -- whenever the cancel count is unbalanced they will let the cancellation bubble out.

The loser here is cancel messages -- but the winner is simplicity of the scheme, and I think deprecating cancel messages will improve simplicity again over the long run.

agronholm · 2022-02-27T17:08:04Z

What you just said echoes the need for a cancel counter/scope scheme, but not for the cancellation state changes. All of this cancel stuff could be done on a per-iteration basis. We just need to move the counter/scope logic to CancelledError and reset the cancellation state after it's been delivered to the task, as before 3.11. I don't think there is any way to make existing cancel scope-like context managers to play nice with each other (unless they explicitly cooperate, like AnyIO CancelScopes). But at least with this change, such context managers would continue to work as well as they did before 3.11. Imagine having an async operation in a try...finally block, and the finally block then performs async finalization within a context manager that provides a timeout. This will stop working with the current main branch code because the context manager never uncancels the task and thus the timeout can never trigger a new cancellation on the task.

Now, as for the differences between cancel counters and tokens/nonces, I didn't find anything that would favor one solution over the other, so I'll be content with counters so long as we move the logic to be per-iteration.

gvanrossum · 2022-02-27T17:25:16Z

I am just guessing that when you're pleading for a "per-iteration" scheme you're talking about the two lines added to Task.cancel() that make it a no-op (other than bumping the counter) if the counter is already nonzero, right?

And the counter-example you use is something like

try:
    async with timeout(A):
        try:
            asyncio.sleep(1)
        finally:
            async with timeout(B):
                asyncio.sleep(2)

Where if timeout(A) cancels sleep(1), timeout(B) is powerless to interrupt sleep(2).

Right?

agronholm · 2022-02-27T17:36:05Z

I am just guessing that when you're pleading for a "per-iteration" scheme you're talking about the two lines added to Task.cancel() that make it a no-op (other than bumping the counter) if the counter is already nonzero, right?

As I recall, the backwards incompatible change that was made in main was about a task staying in cancelled state until task.uncancel() was explicitly called as many times as task.cancel(). This behavior would have to change so that the timeout context manager in finally: would at least be able to schedule a new cancellation via loop.call_soon(task.cancel).

And the counter-example you use is something like
try:
    async with timeout(A):
        try:
            asyncio.sleep(1)
        finally:
            async with timeout(B):
                asyncio.sleep(2)
Where if timeout(A) cancels sleep(1), timeout(B) is powerless to interrupt sleep(2).

Right?

Exactly.

gvanrossum · 2022-02-27T21:08:43Z

When I remove the two offending lines from Task.cancel() the only test that fails is test_cancelling(), which tests specifically for this behavior. And in particular, everything in test_taskgroups.py passes.

So that suggests that task groups don't actually depend on this behavior.

I would like to wait for @asvetlov to respond to the counter-example. IIUC he's got more important stuff to worry about for now.

@asvetlov

Also from the _asyncio C accelerator module, and adjust one test that the change caused to fail. For more discussion see the discussion starting here: #31394 (comment) (Basically, @asvetlov proposed to return False from cancel() when there is already a pending cancellation, and I went along, even though it wasn't necessary for the task group implementation, and @agronholm has come up with a counterexample that fails because of this change. So now I'm changing it back to the old semantics (but still bumping the counter) until we can have a proper discussion about this.)

gvanrossum · 2022-02-28T20:26:21Z

@agronholm, @Tinche, @asvetlov, @1st1 -- I've submitted a PR (#31623) that restores the old semantics of Task.cancel(), but I would like to have a longer discussion about this, since I suspect that the old semantics also have some downside (probably for task groups, maybe also for cancel scopes).

Since @asvetlov is in a war zone we should be patient. (Andrew, I hope you and your family are safe!)

@asvetlov

Also from the _asyncio C accelerator module, and adjust one test that the change caused to fail. For more discussion see the discussion starting here: #31394 (comment) (Basically, @asvetlov proposed to return False from cancel() when there is already a pending cancellation, and I went along, even though it wasn't necessary for the task group implementation, and @agronholm has come up with a counterexample that fails because of this change. So now I'm changing it back to the old semantics (but still bumping the counter) until we can have a proper discussion about this.)

asvetlov · 2022-03-02T15:01:37Z

The PR is ready.
It has tests but there is no documentation yet.

P.S. My family and I are good, thanks.

kumaraditya303

This looks good. I have left some comments, will do a full review by tomorrow. Thanks

Lib/asyncio/timeouts.py

gvanrossum

A few nits. I'm getting some local test failures on Windows, will investigate later.

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+import enum
+
+from types import TracebackType
+from typing import final, Any, Dict, Optional, Type


Suggested change

from typing import final, Any, Dict, Optional, Type

from typing import final, Optional, Type

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+        self._state = _State.CREATED
+
+        self._timeout_handler: Optional[events.TimerHandle] = None
+        self._task: Optional[tasks.Task[Any]] = None


Suggested change

self._task: Optional[tasks.Task[Any]] = None

self._task: Optional[tasks.Task] = None

I got a complaint from pyright (VS Code + pylance + pyright in strict mode) because Task is generic in typeshed but not in the stdlib, and this being part of the stdlib it resolves to the stdlib Task. We can make it slightly more happy by leaving out the [Any] -- that's the default anyways.

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+    def __repr__(self) -> str:
+        info = ['']
+        if self._state is _State.ENTERED:
+            info.append(f"when={self._when:.3f}")


This will crash if when is None. Test:

async with asyncio.timeout(None) as t: print(repr(t))

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+def timeout_at(when: Optional[float]) -> Timeout:
+    """Schedule the timeout at absolute time.
+
+    when argument points on the time in the same clock system


Suggested change

when argument points on the time in the same clock system

Like `timeout() but argument gives absolute time in the same clock system

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+    Useful in cases when you want to apply timeout logic around block
+    of code or in cases when asyncio.wait_for is not suitable. For example:
+
+    >>> with timeout(10):  # 10 seconds timeout
+    ...     await long_running_task()


Maybe explain that long_running_task() will receive CancelledError, but the context manager will raise TimeoutError.

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+    Useful in cases when you want to apply timeout logic around block
+    of code or in cases when asyncio.wait_for is not suitable. For example:
+
+    >>> with timeout(10):  # 10 seconds timeout


Suggested change

>>> with timeout(10): # 10 seconds timeout

>>> async with timeout(10): # 10 seconds timeout

gvanrossum · 2022-03-02T21:38:55Z

Lib/asyncio/timeouts.py

+
+
+def timeout(delay: Optional[float]) -> Timeout:
+    """timeout context manager.


Suggested change

"""timeout context manager.

"""timeout async context manager.

gvanrossum

Never mind about the Windows errors, they were because I hadn't recompiled the _asyncio extension. Sorry for the false alarm.

Here are a lot of nits about the tests. Let me know if you want me to just fix everything and I will.

gvanrossum · 2022-03-03T06:04:13Z