multiprocessing.Pool terminate shuts down workers too early #96995

mtrofin · 2022-09-21T15:12:20Z

Environment:
Debian 5.10.113-1, python 3.7 (can repro on python 3.10, too)

Repro:
This was found in a LLVM test, easiest way to repro (not the most minimal, though):

git clone git@github.com:llvm/llvm-project.git

while true ; do llvm/utils/lit/lit.py -j1 --order=lexical --max-failures=1 llvm/utils/lit/tests/Inputs/max-failures ; done

This sets up a 1-worker pool (-j1). The debug output looks like this:

-- Testing: 3 tests, 1 workers --
created semlock with handle 140702642597888
created semlock with handle 140702642593792
created semlock with handle 140702628683776
created semlock with handle 140702628679680
created semlock with handle 140702628675584
created semlock with handle 140702628671488
added worker
closing pool
FAIL: max-failures :: fail1.txt (1 of 3)
terminating pool
finalizing pool
helping task handler/workers to finish
removing tasks from inqueue until task handler finished
result handler found thread._state=TERMINATE
ensuring that outqueue is not full
joining worker handler
worker handler exiting
result handler exiting: len(cache)=%s, thread._state=%s 2 TERMINATE
terminating workers
task handler got sentinel
joining task handler
task handler sending sentinel to result handler

...and now it hangs

After attaching with gdb:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 376, in put
    with self._wlock:
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 562, in _handle_tasks
    outqueue.put(None)
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()

Upon looking further: looks like multiprocessing.Pool's terminate attempts to terminate (as in Popen.terminate) the worker processes too early. If worker was in outqueue's critical section (that _wlock was taken) when it's shut down, that may explain the issue.

I tried calling terminate on the pool processes only after joining the task_handler and result_handler, i.e. :

711,716d710
<         # Terminate workers which haven't already finished.
<         if pool and hasattr(pool[0], 'terminate'):
<             util.debug('terminating workers')
<             for p in pool:
<                 if p.exitcode is None:
<                     p.terminate()
724a719,725
>
>         # Terminate workers which haven't already finished.
>         if pool and hasattr(pool[0], 'terminate'):
>             util.debug('terminating workers')
>             for p in pool:
>                 if p.exitcode is None:
>                     p.terminate()

That appears to address the issue - the script runs now indefinitely.

The code in Pool.terminate seems to have been around for a while, so maybe I am missing something? If not, happy to submit a patch!

The text was updated successfully, but these errors were encountered:

mtrofin added the type-bug An unexpected behavior, bug, or error label Sep 21, 2022

AlexWaygood added the expert-multiprocessing label Sep 21, 2022

multiprocessing.Pool terminate shuts down workers too early #96995

multiprocessing.Pool terminate shuts down workers too early #96995

Comments

mtrofin commented Sep 21, 2022 • edited

mtrofin commented Sep 21, 2022 •

edited