-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062
Comments
python3.11rc1:
manually killed 29291, pressed ctrl+C after long waiting |
ctrl+C will close the main process, but this problem is not solved, the task that needs to be performed is not performed |
Process crashes are unpredictable, and if the child process crashes when I use multiprocessing, the example above shows that the problem affects the main process, which is probably unacceptable. |
I am not sure if this is related to it but I had something similar in the past with GitHub Actions and multiprocessing. Sometimes, the workflow got stuck and I had to cancel it manually. This only happened on specific errors where the process worker failed but it did not happen locally on Windows, only on Linux which is running on the server for the workflow. So in other words, when you had: with multiprocessing.Pool(processes=os.cpu_count()) as pool:
pool,map(....) # do stuff here To fix it, I had to change it to: with multiprocessing.get_context('spawn').Pool(processes=os.cpu_count()) as pool:
pool.map(...) # do stuff here As far as I understood, the reason for it is that Python uses a different default spawn method for Linux and Windows. And that spawn method has some weird stuff going on that I don't understand personally but it creates this issue only on Linux to happen and can be resolved by specifying it with this small addition. Maybe this can help you or someone else who may find this open pending issue via Google like I did. |
same here: if __name__ == '__main__':
def foo(i):
time.sleep(1)
return i
# create a pool
pool = multiprocessing.get_context("spawn").Pool(1)
# wait a moment
time.sleep(3)
print(pool._pool)
pool._pool[0].kill()
time.sleep(3)
print(pool._pool)
print(pool.apply_async(foo, (1,)).get(3))
|
As mentioned here https://docs.python.org/3/library/multiprocessing.html
I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted. |
just add this: |
My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it. |
So I think a good practice is do not kill a process waiting for task. It's safe to kill a process that processing jobs. |
Using process pools for services is risky, at least in my opinion. In the end, because I only needed to open a new process, I imitated the code of the process pool and removed the logic of lock. |
We experienced similar problems using If anyone is interested, we provide our solution which overcomes this issue by monitoring the pool for dead processes. It is definitely not ideal, but should work for many use cases: import multiprocessing
import time
from datetime import datetime
nprocs = 2
timeout = 36000
pool = multiprocessing.Pool(nprocs)
procs = set(pool._pool)
try:
start = datetime.now()
promise = pool.map_async(mapper, args)
while True:
procs.update(pool._pool)
if promise.ready():
results = promise.get()
pool.close()
pool.join()
return results
if any(map(lambda p: not p.is_alive(), procs)):
raise RuntimeError("Some worker process has exited!")
if (datetime.now() - start).total_seconds() >= timeout:
# When get timeout is over, force throwing expected TimeoutError
promise.get(1)
time.sleep(5)
finally:
pool.terminate()
pool.join() |
Bug report
When I use multiprocessing.Pool and let processes=1 to execute the task, if I manually kill the child process in the background, the task will not be executed, and the new child process seems to be waiting indefinitely and cannot be terminated.
Here is the example I tested:
and the output is(I manually kill the process 30995):
From the output, when I kill the child process, multiprocessing.Pool does start a new process, but the task cannot continue, and terminate() seems to be stuck somewhere, because my main process is not over, been waiting.
During the running process of the service, the process of crashing is unpredictable, so I did such a test: when using multi-process, what effect will the child process crash have on the program. Finally found such a problem.
Your environment
The text was updated successfully, but these errors were encountered: