Repro:
This was found in a LLVM test, easiest way to repro (not the most minimal, though):
git clone git@github.com:llvm/llvm-project.git
while true ; do llvm/utils/lit/lit.py -j1 --order=lexical --max-failures=1 llvm/utils/lit/tests/Inputs/max-failures ; done
This sets up a 1-worker pool (-j1). The debug output looks like this:
-- Testing: 3 tests, 1 workers --
created semlock with handle 140702642597888
created semlock with handle 140702642593792
created semlock with handle 140702628683776
created semlock with handle 140702628679680
created semlock with handle 140702628675584
created semlock with handle 140702628671488
added worker
closing pool
FAIL: max-failures :: fail1.txt (1 of 3)
terminating pool
finalizing pool
helping task handler/workers to finish
removing tasks from inqueue until task handler finished
result handler found thread._state=TERMINATE
ensuring that outqueue is not full
joining worker handler
worker handler exiting
result handler exiting: len(cache)=%s, thread._state=%s 2 TERMINATE
terminating workers
task handler got sentinel
joining task handler
task handler sending sentinel to result handler
...and now it hangs
After attaching with gdb:
(gdb) py-bt
Traceback (most recent call first):
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
File "/usr/lib/python3.10/multiprocessing/queues.py", line 376, in put
with self._wlock:
File "/usr/lib/python3.10/multiprocessing/pool.py", line 562, in _handle_tasks
outqueue.put(None)
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
Upon looking further: looks like multiprocessing.Pool's terminate attempts to terminate (as in Popen.terminate) the worker processes too early. If worker was in outqueue's critical section (that _wlock was taken) when it's shut down, that may explain the issue.
I tried calling terminate on the pool processes only after joining the task_handler and result_handler, i.e. :
711,716d710
< # Terminate workers which haven't already finished.
< if pool and hasattr(pool[0], 'terminate'):
< util.debug('terminating workers')
< for p in pool:
< if p.exitcode is None:
< p.terminate()
724a719,725
>
> # Terminate workers which haven't already finished.
> if pool and hasattr(pool[0], 'terminate'):
> util.debug('terminating workers')
> for p in pool:
> if p.exitcode is None:
> p.terminate()
That appears to address the issue - the script runs now indefinitely.
The code in Pool.terminate seems to have been around for a while, so maybe I am missing something? If not, happy to submit a patch!
The text was updated successfully, but these errors were encountered:
mtrofin commentedSep 21, 2022
•
edited
Environment:
Debian 5.10.113-1, python 3.7 (can repro on python 3.10, too)
Repro:
This was found in a LLVM test, easiest way to repro (not the most minimal, though):
This sets up a 1-worker pool (-j1). The debug output looks like this:
...and now it hangs
After attaching with gdb:
Upon looking further: looks like
multiprocessing.Pool
'sterminate
attempts toterminate
(as inPopen.terminate
) the worker processes too early. Ifworker
was inoutqueue
's critical section (that_wlock
was taken) when it's shut down, that may explain the issue.I tried calling
terminate
on the pool processes only after joining the task_handler and result_handler, i.e. :That appears to address the issue - the script runs now indefinitely.
The code in
Pool.terminate
seems to have been around for a while, so maybe I am missing something? If not, happy to submit a patch!The text was updated successfully, but these errors were encountered: