multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

FourthC · 2022-08-18T09:18:15Z

Bug report

When I use multiprocessing.Pool and let processes=1 to execute the task, if I manually kill the child process in the background, the task will not be executed, and the new child process seems to be waiting indefinitely and cannot be terminated.
Here is the example I tested:

import logging
import multiprocessing
import platform
import time
from multiprocessing import Pool

multiprocessing.log_to_stderr().setLevel(logging.DEBUG)


def print_some(i):
    print("Current process name is %s" % multiprocessing.current_process())
    print("--------------"+str(i)+"--------------")
    return "return "+str(i)


def callback_func(n):
    print (n)


if __name__ == "__main__":
    print(platform.python_version())
    multiprocessing.set_start_method('fork')
    p = Pool(1)
    i = 0
    print(p._pool[0].pid)
    while i < 6:
        p.apply_async(print_some, (i, ), callback=callback_func)
        time.sleep(3)
        i = i+1
    print("end")
    print(p._pool[0].pid)
    p.terminate()
    print("close")

and the output is(I manually kill the process 30995):

3.8.2
[DEBUG/MainProcess] created semlock with handle 6
[DEBUG/MainProcess] created semlock with handle 7
[DEBUG/MainProcess] created semlock with handle 10
[DEBUG/MainProcess] created semlock with handle 11
[DEBUG/MainProcess] created semlock with handle 14
[DEBUG/MainProcess] created semlock with handle 15
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-1] child process calling self.run()
30995
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------3--------------
return 3
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] task handler exiting
[DEBUG/MainProcess] result handler got sentinel
end
31000

From the output, when I kill the child process, multiprocessing.Pool does start a new process, but the task cannot continue, and terminate() seems to be stuck somewhere, because my main process is not over, been waiting.
During the running process of the service, the process of crashing is unpredictable, so I did such a test: when using multi-process, what effect will the child process crash have on the program. Finally found such a problem.

Your environment

CPython versions tested on: Python3.8.2
Operating system and architecture:MacOS10.15.7 or ubuntu16.0.4

The text was updated successfully, but these errors were encountered:

sharewax · 2022-08-18T09:36:50Z

python3.11rc1:

# python3.11 test.py
3.11.0rc1
[DEBUG/MainProcess] created semlock with handle 140268024721408
[DEBUG/MainProcess] created semlock with handle 140268024717312
[DEBUG/MainProcess] created semlock with handle 140268024713216
[DEBUG/MainProcess] created semlock with handle 140267892850688
[DEBUG/MainProcess] created semlock with handle 140267892846592
[DEBUG/MainProcess] created semlock with handle 140267892842496
[DEBUG/MainProcess] added worker
29291
[INFO/ForkPoolWorker-1] child process calling self.run()
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------3--------------
return 3
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------4--------------
return 4
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
end
29978
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] result handler got sentinel
[DEBUG/MainProcess] task handler exiting


^CTraceback (most recent call last):
[INFO/ForkPoolWorker-2] process shutting down
  File "/root/test.py", line 32, in <module>
[DEBUG/ForkPoolWorker-2] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkPoolWorker-2] running the remaining "atexit" finalizers
Process ForkPoolWorker-2:
    p.terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 657, in terminate
    self._terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 695, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 675, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling terminate() for daemon ForkPoolWorker-2
[INFO/MainProcess] calling join() for process ForkPoolWorker-2
[DEBUG/MainProcess] running the remaining "atexit" finalizers

manually killed 29291, pressed ctrl+C after long waiting

FourthC · 2022-08-18T09:43:55Z

python3.11rc1:

# python3.11 test.py
3.11.0rc1
[DEBUG/MainProcess] created semlock with handle 140268024721408
[DEBUG/MainProcess] created semlock with handle 140268024717312
[DEBUG/MainProcess] created semlock with handle 140268024713216
[DEBUG/MainProcess] created semlock with handle 140267892850688
[DEBUG/MainProcess] created semlock with handle 140267892846592
[DEBUG/MainProcess] created semlock with handle 140267892842496
[DEBUG/MainProcess] added worker
29291
[INFO/ForkPoolWorker-1] child process calling self.run()
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------3--------------
return 3
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------4--------------
return 4
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
end
29978
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] result handler got sentinel
[DEBUG/MainProcess] task handler exiting


^CTraceback (most recent call last):
[INFO/ForkPoolWorker-2] process shutting down
  File "/root/test.py", line 32, in <module>
[DEBUG/ForkPoolWorker-2] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkPoolWorker-2] running the remaining "atexit" finalizers
Process ForkPoolWorker-2:
    p.terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 657, in terminate
    self._terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 695, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 675, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling terminate() for daemon ForkPoolWorker-2
[INFO/MainProcess] calling join() for process ForkPoolWorker-2
[DEBUG/MainProcess] running the remaining "atexit" finalizers

manually killed 29291, pressed ctrl+C after long waiting

ctrl+C will close the main process, but this problem is not solved, the task that needs to be performed is not performed

FourthC · 2022-08-18T09:47:53Z

Process crashes are unpredictable, and if the child process crashes when I use multiprocessing, the example above shows that the problem affects the main process, which is probably unacceptable.

Raichuu41 · 2022-12-02T14:46:53Z

I am not sure if this is related to it but I had something similar in the past with GitHub Actions and multiprocessing. Sometimes, the workflow got stuck and I had to cancel it manually. This only happened on specific errors where the process worker failed but it did not happen locally on Windows, only on Linux which is running on the server for the workflow.
I was able to resolve the issue by specifying get_context('spawn') before creating the Pool instance.

So in other words, when you had:

with multiprocessing.Pool(processes=os.cpu_count()) as pool:
    pool,map(....)  # do stuff here

To fix it, I had to change it to:

with multiprocessing.get_context('spawn').Pool(processes=os.cpu_count()) as pool:
    pool.map(...)  # do stuff here

As far as I understood, the reason for it is that Python uses a different default spawn method for Linux and Windows. And that spawn method has some weird stuff going on that I don't understand personally but it creates this issue only on Linux to happen and can be resolved by specifying it with this small addition. Maybe this can help you or someone else who may find this open pending issue via Google like I did.

Ethan-yt · 2023-01-12T10:32:55Z

same here:

if __name__ == '__main__':
    def foo(i):
        time.sleep(1)
        return i
    # create a pool
    pool = multiprocessing.get_context("spawn").Pool(1)
    # wait a moment
    time.sleep(3)
    print(pool._pool)
    pool._pool[0].kill()
    time.sleep(3)
    print(pool._pool)
    print(pool.apply_async(foo, (1,)).get(3))

[<SpawnProcess name='SpawnPoolWorker-1' pid=59245 parent=59243 started daemon>] 
[<SpawnProcess name='SpawnPoolWorker-2' pid=59248 parent=59243 started daemon>]
Traceback (most recent call last):
  File "tests/test_pool.py", line 62, in <module>
    print(pool.apply_async(foo, (1,)).get(3))
  File "/Users/ethan/miniconda/lib/python3.8/multiprocessing/pool.py", line 767, in get
    raise TimeoutError
multiprocessing.context.TimeoutError

Ethan-yt · 2023-01-12T11:12:25Z

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.

Ethan-yt · 2023-01-13T03:10:50Z

def foo(i):
    time.sleep(1)
    return i


if __name__ == '__main__':
    pool = multiprocessing.get_context("spawn").Pool(1)
    print(pool.apply_async(foo, (1,)).get())
    print(pool._pool)
    pool._pool[0].kill() # when worker received SIGKILL, it dead immediately, and don't release the lock.
    time.sleep(3)
    # release the lock that locked by a dead process
    pool._inqueue._rlock.release()
    print(pool._pool)
    print(pool.apply_async(foo, (1,)).get())

just add this: pool._inqueue._rlock.release()

FourthC · 2023-02-03T10:13:56Z

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.

If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

Ethan-yt · 2023-02-03T10:18:50Z

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.
If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

So I think a good practice is do not kill a process waiting for task. It's safe to kill a process that processing jobs.

FourthC · 2023-02-03T10:24:42Z

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.
If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

So I think a good practice is do not kill a process waiting for task. It's safe to kill a process that processing jobs.

Using process pools for services is risky, at least in my opinion. In the end, because I only needed to open a new process, I imitated the code of the process pool and removed the logic of lock.

wormsik · 2023-02-10T09:54:33Z

We experienced similar problems using .map_async when one of the processes was killed by Linux OOM killer. Unfortunately, Pool does not propagate this error (e.g. in a similar way as exceptions) to AsyncResult thus .get is frozen either forever or until timeout expires.

If anyone is interested, we provide our solution which overcomes this issue by monitoring the pool for dead processes. It is definitely not ideal, but should work for many use cases:

import multiprocessing
import time
from datetime import datetime

nprocs = 2
timeout = 36000
pool = multiprocessing.Pool(nprocs)
procs = set(pool._pool)
try:
    start = datetime.now()
    promise = pool.map_async(mapper, args)

    while True:
        procs.update(pool._pool)
        if promise.ready():
            results = promise.get()
            pool.close()
            pool.join()
            return results

        if any(map(lambda p: not p.is_alive(), procs)):
            raise RuntimeError("Some worker process has exited!")

        if (datetime.now() - start).total_seconds() >= timeout:
            # When get timeout is over, force throwing expected TimeoutError
            promise.get(1)

        time.sleep(5)
finally:
    pool.terminate()
    pool.join()

FourthC added the type-bug An unexpected behavior, bug, or error label Aug 18, 2022

FourthC mentioned this issue Aug 18, 2022

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96061

Closed

AlexWaygood added the topic-multiprocessing label Jan 12, 2023

Tazmaniac mentioned this issue Sep 19, 2024

AttributeError: 'NoneType' object has no attribute 'fileno' on reconnection with worker_cancel_long_running_tasks_on_connection_loss celery/celery#9250

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

FourthC commented Aug 18, 2022

sharewax commented Aug 18, 2022

FourthC commented Aug 18, 2022

FourthC commented Aug 18, 2022

Raichuu41 commented Dec 2, 2022

Ethan-yt commented Jan 12, 2023

Ethan-yt commented Jan 12, 2023 •

edited

Loading

Ethan-yt commented Jan 13, 2023

FourthC commented Feb 3, 2023

Ethan-yt commented Feb 3, 2023

FourthC commented Feb 3, 2023 •

edited

Loading

wormsik commented Feb 10, 2023

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

Comments

FourthC commented Aug 18, 2022

Bug report

Your environment

sharewax commented Aug 18, 2022

FourthC commented Aug 18, 2022

FourthC commented Aug 18, 2022

Raichuu41 commented Dec 2, 2022

Ethan-yt commented Jan 12, 2023

Ethan-yt commented Jan 12, 2023 • edited Loading

Ethan-yt commented Jan 13, 2023

FourthC commented Feb 3, 2023

Ethan-yt commented Feb 3, 2023

FourthC commented Feb 3, 2023 • edited Loading

wormsik commented Feb 10, 2023

Ethan-yt commented Jan 12, 2023 •

edited

Loading

FourthC commented Feb 3, 2023 •

edited

Loading