Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.Pool gets stuck indefinitely when the child process is killed manually #96062

Open
FourthC opened this issue Aug 18, 2022 · 11 comments
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@FourthC
Copy link

FourthC commented Aug 18, 2022

Bug report

When I use multiprocessing.Pool and let processes=1 to execute the task, if I manually kill the child process in the background, the task will not be executed, and the new child process seems to be waiting indefinitely and cannot be terminated.
Here is the example I tested:

import logging
import multiprocessing
import platform
import time
from multiprocessing import Pool

multiprocessing.log_to_stderr().setLevel(logging.DEBUG)


def print_some(i):
    print("Current process name is %s" % multiprocessing.current_process())
    print("--------------"+str(i)+"--------------")
    return "return "+str(i)


def callback_func(n):
    print (n)


if __name__ == "__main__":
    print(platform.python_version())
    multiprocessing.set_start_method('fork')
    p = Pool(1)
    i = 0
    print(p._pool[0].pid)
    while i < 6:
        p.apply_async(print_some, (i, ), callback=callback_func)
        time.sleep(3)
        i = i+1
    print("end")
    print(p._pool[0].pid)
    p.terminate()
    print("close")

and the output is(I manually kill the process 30995):

3.8.2
[DEBUG/MainProcess] created semlock with handle 6
[DEBUG/MainProcess] created semlock with handle 7
[DEBUG/MainProcess] created semlock with handle 10
[DEBUG/MainProcess] created semlock with handle 11
[DEBUG/MainProcess] created semlock with handle 14
[DEBUG/MainProcess] created semlock with handle 15
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-1] child process calling self.run()
30995
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=30994 started daemon>
--------------3--------------
return 3
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] task handler exiting
[DEBUG/MainProcess] result handler got sentinel
end
31000

From the output, when I kill the child process, multiprocessing.Pool does start a new process, but the task cannot continue, and terminate() seems to be stuck somewhere, because my main process is not over, been waiting.
During the running process of the service, the process of crashing is unpredictable, so I did such a test: when using multi-process, what effect will the child process crash have on the program. Finally found such a problem.

Your environment

  • CPython versions tested on: Python3.8.2
  • Operating system and architecture:MacOS10.15.7 or ubuntu16.0.4
@sharewax
Copy link

python3.11rc1:

# python3.11 test.py
3.11.0rc1
[DEBUG/MainProcess] created semlock with handle 140268024721408
[DEBUG/MainProcess] created semlock with handle 140268024717312
[DEBUG/MainProcess] created semlock with handle 140268024713216
[DEBUG/MainProcess] created semlock with handle 140267892850688
[DEBUG/MainProcess] created semlock with handle 140267892846592
[DEBUG/MainProcess] created semlock with handle 140267892842496
[DEBUG/MainProcess] added worker
29291
[INFO/ForkPoolWorker-1] child process calling self.run()
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------3--------------
return 3
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------4--------------
return 4
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
end
29978
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] result handler got sentinel
[DEBUG/MainProcess] task handler exiting


^CTraceback (most recent call last):
[INFO/ForkPoolWorker-2] process shutting down
  File "/root/test.py", line 32, in <module>
[DEBUG/ForkPoolWorker-2] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkPoolWorker-2] running the remaining "atexit" finalizers
Process ForkPoolWorker-2:
    p.terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 657, in terminate
    self._terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 695, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 675, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling terminate() for daemon ForkPoolWorker-2
[INFO/MainProcess] calling join() for process ForkPoolWorker-2
[DEBUG/MainProcess] running the remaining "atexit" finalizers

manually killed 29291, pressed ctrl+C after long waiting

@FourthC
Copy link
Author

FourthC commented Aug 18, 2022

python3.11rc1:

# python3.11 test.py
3.11.0rc1
[DEBUG/MainProcess] created semlock with handle 140268024721408
[DEBUG/MainProcess] created semlock with handle 140268024717312
[DEBUG/MainProcess] created semlock with handle 140268024713216
[DEBUG/MainProcess] created semlock with handle 140267892850688
[DEBUG/MainProcess] created semlock with handle 140267892846592
[DEBUG/MainProcess] created semlock with handle 140267892842496
[DEBUG/MainProcess] added worker
29291
[INFO/ForkPoolWorker-1] child process calling self.run()
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------0--------------
return 0
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------1--------------
return 1
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------2--------------
return 2
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------3--------------
return 3
Current process name is <ForkProcess name='ForkPoolWorker-1' parent=29290 started daemon>
--------------4--------------
return 4
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/ForkPoolWorker-2] child process calling self.run()
end
29978
[DEBUG/MainProcess] terminating pool
[DEBUG/MainProcess] finalizing pool
[DEBUG/MainProcess] helping task handler/workers to finish
[DEBUG/MainProcess] removing tasks from inqueue until task handler finished
[DEBUG/MainProcess] worker handler exiting
[DEBUG/MainProcess] task handler got sentinel
[DEBUG/MainProcess] task handler sending sentinel to result handler
[DEBUG/MainProcess] task handler sending sentinel to workers
[DEBUG/MainProcess] result handler got sentinel
[DEBUG/MainProcess] task handler exiting


^CTraceback (most recent call last):
[INFO/ForkPoolWorker-2] process shutting down
  File "/root/test.py", line 32, in <module>
[DEBUG/ForkPoolWorker-2] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkPoolWorker-2] running the remaining "atexit" finalizers
Process ForkPoolWorker-2:
    p.terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 657, in terminate
    self._terminate()
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 695, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/opt/rh/rh-python311/root/usr/lib64/python3.11/multiprocessing/pool.py", line 675, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling terminate() for daemon ForkPoolWorker-2
[INFO/MainProcess] calling join() for process ForkPoolWorker-2
[DEBUG/MainProcess] running the remaining "atexit" finalizers

manually killed 29291, pressed ctrl+C after long waiting

ctrl+C will close the main process, but this problem is not solved, the task that needs to be performed is not performed

@FourthC
Copy link
Author

FourthC commented Aug 18, 2022

Process crashes are unpredictable, and if the child process crashes when I use multiprocessing, the example above shows that the problem affects the main process, which is probably unacceptable.

@Raichuu41
Copy link

I am not sure if this is related to it but I had something similar in the past with GitHub Actions and multiprocessing. Sometimes, the workflow got stuck and I had to cancel it manually. This only happened on specific errors where the process worker failed but it did not happen locally on Windows, only on Linux which is running on the server for the workflow.
I was able to resolve the issue by specifying get_context('spawn') before creating the Pool instance.

So in other words, when you had:

with multiprocessing.Pool(processes=os.cpu_count()) as pool:
    pool,map(....)  # do stuff here

To fix it, I had to change it to:

with multiprocessing.get_context('spawn').Pool(processes=os.cpu_count()) as pool:
    pool.map(...)  # do stuff here

As far as I understood, the reason for it is that Python uses a different default spawn method for Linux and Windows. And that spawn method has some weird stuff going on that I don't understand personally but it creates this issue only on Linux to happen and can be resolved by specifying it with this small addition. Maybe this can help you or someone else who may find this open pending issue via Google like I did.

@Ethan-yt
Copy link

same here:

if __name__ == '__main__':
    def foo(i):
        time.sleep(1)
        return i
    # create a pool
    pool = multiprocessing.get_context("spawn").Pool(1)
    # wait a moment
    time.sleep(3)
    print(pool._pool)
    pool._pool[0].kill()
    time.sleep(3)
    print(pool._pool)
    print(pool.apply_async(foo, (1,)).get(3))
[<SpawnProcess name='SpawnPoolWorker-1' pid=59245 parent=59243 started daemon>] 
[<SpawnProcess name='SpawnPoolWorker-2' pid=59248 parent=59243 started daemon>]
Traceback (most recent call last):
  File "tests/test_pool.py", line 62, in <module>
    print(pool.apply_async(foo, (1,)).get(3))
  File "/Users/ethan/miniconda/lib/python3.8/multiprocessing/pool.py", line 767, in get
    raise TimeoutError
multiprocessing.context.TimeoutError

@Ethan-yt
Copy link

Ethan-yt commented Jan 12, 2023

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.

@Ethan-yt
Copy link

def foo(i):
    time.sleep(1)
    return i


if __name__ == '__main__':
    pool = multiprocessing.get_context("spawn").Pool(1)
    print(pool.apply_async(foo, (1,)).get())
    print(pool._pool)
    pool._pool[0].kill() # when worker received SIGKILL, it dead immediately, and don't release the lock.
    time.sleep(3)
    # release the lock that locked by a dead process
    pool._inqueue._rlock.release()
    print(pool._pool)
    print(pool.apply_async(foo, (1,)).get())

just add this: pool._inqueue._rlock.release()

@FourthC
Copy link
Author

FourthC commented Feb 3, 2023

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.

If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

@Ethan-yt
Copy link

Ethan-yt commented Feb 3, 2023

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.
If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

So I think a good practice is do not kill a process waiting for task. It's safe to kill a process that processing jobs.

@FourthC
Copy link
Author

FourthC commented Feb 3, 2023

As mentioned here https://docs.python.org/3/library/multiprocessing.html

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

I guess it is because before the process get killed, it is waiting for the task in the queue, which means it has acquired the lock. Once it get killed, it haven't release the lock, and others can't get tasks in the queue, the queue corrupted.
If I increase the worker number above, i.e from 1 to 2, the error seems disappeared.

My thoughts are the same as yours. If we kill -9 process just happens to get the lock process in the process pool, kill it will not release the lock at this time, causing the entire process pool to fail to work. It may also happen when the number is increased to 2, I have tested it.

So I think a good practice is do not kill a process waiting for task. It's safe to kill a process that processing jobs.

Using process pools for services is risky, at least in my opinion. In the end, because I only needed to open a new process, I imitated the code of the process pool and removed the logic of lock.

@wormsik
Copy link

wormsik commented Feb 10, 2023

We experienced similar problems using .map_async when one of the processes was killed by Linux OOM killer. Unfortunately, Pool does not propagate this error (e.g. in a similar way as exceptions) to AsyncResult thus .get is frozen either forever or until timeout expires.

If anyone is interested, we provide our solution which overcomes this issue by monitoring the pool for dead processes. It is definitely not ideal, but should work for many use cases:

import multiprocessing
import time
from datetime import datetime

nprocs = 2
timeout = 36000
pool = multiprocessing.Pool(nprocs)
procs = set(pool._pool)
try:
    start = datetime.now()
    promise = pool.map_async(mapper, args)

    while True:
        procs.update(pool._pool)
        if promise.ready():
            results = promise.get()
            pool.close()
            pool.join()
            return results

        if any(map(lambda p: not p.is_alive(), procs)):
            raise RuntimeError("Some worker process has exited!")

        if (datetime.now() - start).total_seconds() >= timeout:
            # When get timeout is over, force throwing expected TimeoutError
            promise.get(1)

        time.sleep(5)
finally:
    pool.terminate()
    pool.join()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

6 participants