New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot cleanly shut down an asyncio based server #113538
Comments
This might be a symptom of a more general issue, that there is a need for cancelling everything that's dependent on the event loop we are shutting down. Currently, it is possible to cancel tasks, async generators, and executors. The issue here is (sort-of) about shutting down pending file descriptors. But it might also be necessary to be able to cancel pending futures, and possibly more things? |
I've always assumed that it is up to the application to keep track of connections (etc.) in a form that is suitable for shutting down. After all in a typical application there's more to clean up than the client connections -- there may be database connections, and who know what else to handle. I would definitely not mess with pending futures -- those are sufficiently low-level that you cannot know who is waiting for them and why, and in which order they should be cancelled. The key point of control you have is probably cancelling tasks -- tasks can do cleanup in response to a cancellation request, and if your tasks form a tree (or even a DAG), they will be cancelled recursively (if task t1 is waiting for task t2, cancelling t1 first cancels t2). So perhaps your application should keep track of the main task per connection? |
Another thought is that this might be something that a higher-level framework or library could offer. E.g. gunicorn or aio-http? (Note that I've never used either, so I may be way off base here.) |
asyncio unfortunately makes this difficult, as it is the one in control of creating the connections. It is done by the Clean up should hopefully be possible using the normal signals from the transports, i.e.
You are probably right. But it does make me uneasy that there isn't a way to centrally force a clean-up on shutdown. If you have to build your own everywhere, then it's easy to make a mistake and overlook something.
Unfortunately, there isn't always a task involved. Sometimes there is just a protocol, or a construct using futures and callbacks.
Maybe. Those two specific projects are not sufficient, though, as they are just for HTTP. |
Okay, let's dive into the original problem a bit more. I think I would like to see a "scaled-down" example of a realistic server that we might be able to use to demonstrate the problem, reason about it, and evaluate potential solutions. It should be small enough that I can understand the code in 5-10 minutes (my typical attention span :-). Would you be willing to code up a first attempt at such an example? We can iterate on it together. Regarding Futures, these are used all over the place (internally in asyncio) with specific semantics, sometimes involving cancellation. If we were to start cancelling Futures automatically when a loop is stopped that could have all sorts of unexpected consequences -- probably worse than leaving them alone. So let's declare this out of scope (or if it really bugs you, please open a separate issue, as it has nothing to do with closing servers). |
This should be a rough mockup of the situation: #!/usr/bin/python3
import asyncio
class EchoServerProtocol(asyncio.Protocol):
def connection_made(self, transport):
peername = transport.get_extra_info('peername')
print('Connection from {}'.format(peername))
self.transport = transport
def data_received(self, data):
message = data.decode()
print('Data received: {!r}'.format(message))
print('Send: {!r}'.format(message))
self.transport.write(data)
print('Close the client socket')
self.transport.close()
def stop():
loop = asyncio.get_running_loop()
loop.stop()
def server_up(fut):
global server
server = fut.result()
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
fut = asyncio.ensure_future(loop.create_server(
lambda: EchoServerProtocol(),
'127.0.0.1', 8888))
fut.add_done_callback(server_up)
# Simulate termination
loop.call_later(2.5, stop)
try:
loop.run_forever()
finally:
server.close()
# Can't run this as it will hang:
#loop.run_until_complete(server.wait_closed())
loop.close() If you run this in dev mode and connect a client, you see the issue: $ python3.12 -X dev echosrv.py
Connection from ('127.0.0.1', 56290)
sys:1: ResourceWarning: unclosed <socket.socket fd=7, family=2, type=1, proto=6, laddr=('127.0.0.1', 8888), raddr=('127.0.0.1', 56290)>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib64/python3.12/asyncio/selector_events.py:875: ResourceWarning: unclosed transport <_SelectorSocketTransport fd=7>
_warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback |
This wasn't a problem until 3.11.5. Here's a simpler example of what I think is the same root issue: #!/usr/bin/env python3
import asyncio
async def main():
async def on_connection(reader, writer):
while True:
buf = await reader.read(1024)
print("got bytes", buf)
server = await asyncio.start_unix_server(on_connection, path="/tmp/sock")
cr, cw = await asyncio.open_unix_connection(path="/tmp/sock")
print("sending hi")
cw.write(b"hi")
await cw.drain()
await asyncio.sleep(0.2)
server.close()
await server.wait_closed()
asyncio.run(main()) then using pyenv to set the python version and run it we get the following:
I have yet to look at the underlying code but I'm guessing the server implementation in <=3.11.4 is cancelling the reader.read coroutine and ignoring the cancellation error as I'd expect.
skipping 3.11.6, it's identical to 3.11.5.
then 3.12.0 and 3.12.1 hang indefinitely on |
looks like this is related to #109538 |
With @CendioOssman's example (connecting a client that doesn't send anything, just hangs) I get different results -- at first I thought I saw what they saw, but then I couldn't repro it again. Weird, let's say that's my fault. With @jsundahl's example I get exactly what they got; main is the same as 3.12. Let me do some spelunking. |
A bit more...
Alas, the fix waits until all handlers complete, which in this example is never. Hence the hang. I guess the correct way to write the handler is to put a timeout on the I have to think about this more, but I'm out of time for today. Sorry. |
I think the use case for E.g. a clean version of: for fd, transport in loop._transports.items():
transport.abort()
loop.run_until_complete(server.wait_closed()) |
One workaround: if you're happy with the old (pre-3.11) behavior, just delete the If you don't trust your clients, you can just exit the process without prejudice, right? Perhaps after a timeout. Any warnings or errors getting logged at that point will help you track down the unreliable bits. I am reluctant trying to backport any more fixes to 3.12 or 3.11, given how sketchy this has been already. I don't think we can afford to keep track of all transports -- those are owned by the application, not by the loop. However, you could (after a delay) enumerate all tasks and cancel them. That should clean things up. |
@gvanrossum that's only kind of a solution for 3.12. For 3.11 there's still that "event loop is closed" issue. It seems to me like it should be feasible for the server to own its "on_connection" tasks and cancel them on close. |
some way to forcibly cancel this task defined here cpython/Lib/asyncio/streams.py Line 258 in 7d01fb4
|
Okay, now I see one improvement: the if task.cancelled():
transport.close()
return and this can be backported to 3.12 and 3.11. I think you will find that with this change, even when that task lingers because the connection is never closed, it doesn't cause any problems. Shall I proceed with that? @kumaraditya303 I suppose this was an oversight in your original PR #111601 ? |
Yeah that's definitely an improvement that could be made. I don't think that fixes the hang in python 3.12 though, but maybe that's not something to take on for 3.12. I'm going to try to find some time in the next couple days to dig in a bit deeper and get a dev environment set up to play around with this issue. |
I'm actually getting cold feet about that fix. The task exists because Not much is said about the case where If it's a coroutine (i.e., But what should happen if the task is cancelled? The current state of affairs is that the |
FWIW the hang is because the callback never returns and the (toy) application has neither a timeout on its reads nor another way to make it exit its infinite loop. One way to for the appplication to have more control would be to use a plain function as the The most likely reason the task is cancelled would seem to be when ^C is hit. In this case, if we don't close the transport, we get these messages (in addition to the traceback):
If we do close the transport, that's reduced to this:
Those remaining messages are about the client ( So, concluding, I think that closing the transport when the task is cancelled is the right thing to do. If you agree, please approve the PR. |
…task is cancelled (pythonGH-113690) (cherry picked from commit 4681a52) Co-authored-by: Guido van Rossum <guido@python.org>
…task is cancelled (pythonGH-113690) (cherry picked from commit 4681a52) Co-authored-by: Guido van Rossum <guido@python.org>
@jsundahl @CendioOssman Is there anything else you think needs to be done here? You have my view on the hang in 3.12. |
nope, thanks for taking action on this @gvanrossum ! |
You're welcome! I hope I didn't break yet another workflow this time. :-) |
In case that wasn't clear,
Unfortunately, that only discusses tasks. In my example, there are no tasks, and hence no mechanism to cancel/close anything. So I'm afraid I still think there is something missing in the API.
But I'd like to avoid cluttering up logs with those warnings and errors. There is nothing unreliable in the given scenario. External actors simply cannot generally be trusted to be well-behaved, and a robust server should be able to deal with that. There might also be other clean-ups that would trigger on a transport close that currently isn't being executed. The ideal scenario might look something like:
(ideally 2, 4, and 6 are not fixed sleeps, but can finish early if everything terminates cleanly) |
The quick fix would probably be to promote |
I'm very reluctant to make a public API for What I also learned from that issue is that transports are supposed to clean up when they are GC'ed. I am guessing that in your scenario this isn't enough, because the protocol never exits so it never releases the transport. (I forget what keeps protocols alive -- it could be just a reference cycle between transport and protocol.) Finally, there's a whole other world on Windows, where instead of selectors we (by default) use proactors. A proactor event loop does not keep track of its transports, because there is no need (proactors don't support
Aha! While @jsundahl's example used So the question is how to set a timeout on a protocol. I think the answer is that your protocol should define a method If that looks like a lot of boilerplate, maybe it is, but I'm sure you could define a helper class to make it simpler.
By this you are referring to clients, right? I completely agree. But I think it is the job of the application to deal with these -- the server infrastructure cannot know when a client is assumed to be misbehaving. If you get warnings in the logs due to misbehaving clients, your first reaction ought to be to add something to your application code (e.g. a timeout like I suggested above). That's why the warnings are being logged in the first place. So, I am still firm in my insistence that applications already have everything they need to handle misbehaving clients (in particular those that never close their connection), and a universal mechanism to track down and close all transports is not needed. |
I'm going to assume silence is assent. Later this week I'd like to close this issue as completed, unless there is pushback. (If you need more time or feel I am too pushy, just add a brief note and I'll keep it open.) |
That is the current use case, but I don't see why that would prevent us from adding more things if we think they are useful and worthwhile? :) We can always lock down correct behaviour using unit tests.
Abandoning things doesn't seem very Pythonic. And the current code complains with a ResourceWarning, so I am not convinced that's the intended way to do things. I read that comment as "it's a common antipattern to abandon transports, so we cannot rely on And transports are kept alive by the main loop anyway, so they cannot be abandoned. Which is a good thing! If you're using a protocol-based system, the protocol will often not be able to start tasks until it has had some initial communication over the transport (e.g. reading an HTTP request).
That's just a small matter of implementation. :) The key issue is what API we want/need. I'm confident there is a way to implement tracking of transports on Windows as well.
We actually prefer the low-level API for our core stuff. It makes the asynchronous nature more visible, which is very important when trying to write robust code. In a coroutine, it's much easier to miss where it executes concurrently with other things and create races.
I think there might be some misunderstanding of the scenario here. Our use case is when the service is shutting down. Until then, the clients can have all the time in the world. Which means we cannot start the timeout from the protocol. It has no idea when the service is shutting down.
I fully agree! The application is the only party that can decide this. So that's not what we're asking for. What we're asking for is better tools to find the misbehaving clients. Which in this scenario is every client¹, so the definition is pretty straightforward. ¹ So "misbehaving" might not be a fully appropriate description in this case
I'm afraid I still don't agree. Yes, it is possible to solve the issue. But it is messy, which to me indicates something is missing in the core infrastructure. We need a list of all active transports so we can call Options that aren't available: a) Do it in the protocol factory. This is more tightly coupled to the b) Subclassing the event loop or the |
I'm sorry, but I still think you are misunderstanding the architecture around transports and protocols. What you want to do is access all transports associated with a particular I would have less of a problem with a proposal that added a weak set of transports associated with a particular
Where? The |
Bug report
Bug description:
When writing an asyncio based service, you basically have this sequence:
loop.run_forever()
loop.stop()
If there are any connections active at this point, then they don't get discarded until interpreter shutdown, with the result that you get a bunch of ResourceWarnings (and cleanup code might not run).
It would be very useful if there was a
Server.close_clients()
or something like that. Even aServer.all_transports()
would be useful, as then you could do something similar as when doing aTask.cancel()
on what you get fromloop.all_tasks()
.We could poke at
Server._transports
, but that is something internal that might change in the future.There is
Server.wait_closed()
, but that hangs until all clients have gracefully disconnected. It doesn't help us when we want to shut down the service now.CPython versions tested on:
3.12
Operating systems tested on:
Linux
Linked PRs
The text was updated successfully, but these errors were encountered: