Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.Pool terminate shuts down workers too early #96995

Open
mtrofin opened this issue Sep 21, 2022 · 0 comments
Open

multiprocessing.Pool terminate shuts down workers too early #96995

mtrofin opened this issue Sep 21, 2022 · 0 comments
Labels
expert-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@mtrofin
Copy link

mtrofin commented Sep 21, 2022

Environment:
Debian 5.10.113-1, python 3.7 (can repro on python 3.10, too)

Repro:
This was found in a LLVM test, easiest way to repro (not the most minimal, though):

git clone git@github.com:llvm/llvm-project.git

while true ; do llvm/utils/lit/lit.py -j1 --order=lexical --max-failures=1 llvm/utils/lit/tests/Inputs/max-failures ; done

This sets up a 1-worker pool (-j1). The debug output looks like this:

-- Testing: 3 tests, 1 workers --
created semlock with handle 140702642597888
created semlock with handle 140702642593792
created semlock with handle 140702628683776
created semlock with handle 140702628679680
created semlock with handle 140702628675584
created semlock with handle 140702628671488
added worker
closing pool
FAIL: max-failures :: fail1.txt (1 of 3)
terminating pool
finalizing pool
helping task handler/workers to finish
removing tasks from inqueue until task handler finished
result handler found thread._state=TERMINATE
ensuring that outqueue is not full
joining worker handler
worker handler exiting
result handler exiting: len(cache)=%s, thread._state=%s 2 TERMINATE
terminating workers
task handler got sentinel
joining task handler
task handler sending sentinel to result handler

...and now it hangs

After attaching with gdb:

(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 376, in put
    with self._wlock:
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 562, in _handle_tasks
    outqueue.put(None)
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()

Upon looking further: looks like multiprocessing.Pool's terminate attempts to terminate (as in Popen.terminate) the worker processes too early. If worker was in outqueue's critical section (that _wlock was taken) when it's shut down, that may explain the issue.

I tried calling terminate on the pool processes only after joining the task_handler and result_handler, i.e. :

711,716d710
<         # Terminate workers which haven't already finished.
<         if pool and hasattr(pool[0], 'terminate'):
<             util.debug('terminating workers')
<             for p in pool:
<                 if p.exitcode is None:
<                     p.terminate()
724a719,725
>
>         # Terminate workers which haven't already finished.
>         if pool and hasattr(pool[0], 'terminate'):
>             util.debug('terminating workers')
>             for p in pool:
>                 if p.exitcode is None:
>                     p.terminate()

That appears to address the issue - the script runs now indefinitely.

The code in Pool.terminate seems to have been around for a while, so maybe I am missing something? If not, happy to submit a patch!

@mtrofin mtrofin added the type-bug An unexpected behavior, bug, or error label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expert-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

2 participants