Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign up[2.7] bpo-38106: Fix race in PyThread_release_lock that can lead to memory corruption and deadlock #16047
Conversation
This comment has been minimized.
This comment has been minimized.
the-knights-who-say-ni
commented
Sep 12, 2019
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). CLA MissingOur records indicate the following people have not signed the CLA: For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
This comment has been minimized.
This comment has been minimized.
I have signed PSF CLA, but only 30 minutes ago. |
This comment has been minimized.
This comment has been minimized.
I've amended the patch to make bedevere/news happy with the following --- /dev/null
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst
@@ -0,0 +1,2 @@
+Fix race in PyThread_release_lock that was leading to memory corruption and
+deadlocks. |
I reported the bug (see 69db91b "libgolang: Add internal semaphores") to CPython and PyPy upstreams. PyPy people already fixed it and the fix, if I understood correctly, should be available as part of PyPy 7.2 . Let's hope that CPython 2.7 will be also fixed. - https://bugs.python.org/issue38106 -> python/cpython#16047 - https://bitbucket.org/pypy/pypy/issues/3072 Added Python-level test triggers the bug with high probability. Dedicated C-level Semaphore-only test is pending.
@@ -0,0 +1,2 @@ | |||
Fix race in PyThread_release_lock that was leading to memory corruption and | |||
deadlocks. |
This comment has been minimized.
This comment has been minimized.
vstinner
Oct 2, 2019
•
Member
Oh! I just created a very similar local change to prepare a pull request... and then I saw that you created a PR! Oops.
I wrote a different description for the change:
bpo-38106: Fix a race condition in pthread locks
Fix a race condition in locks of the pthread implementation using
mutex with condition variable. Signal condition variables with the
mutex held. Destroy condition variables before their mutexes.
Please mention which exact implementation is changed. We do have 2 implementations for pthread :-)
This comment has been minimized.
This comment has been minimized.
@vstinner thanks for feedback and for noticing my PR. I agree it makes sense to clarify in NEWS about which systems are affected. I've amended the patch with the following interdiff: --- a/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst
@@ -1,2 +1,6 @@
Fix race in PyThread_release_lock that was leading to memory corruption and
deadlocks.
+
+The fix applies to POSIX systems where Python locks are implemented with mutex
+and condition variable because POSIX semaphores are either not provided, or are
+known to be broken. One particular example of such system is macOS. Is it ok? |
LGTM. I would prefer to have a review of a second core dev. @pablogsal, @serhiy-storchaka, @methane, @benjaminp: Would you mind to have a look at this fix for a lock race condition? |
@@ -0,0 +1,6 @@ | |||
Fix race in PyThread_release_lock that was leading to memory corruption and | |||
deadlocks. |
This comment has been minimized.
This comment has been minimized.
vstinner
Oct 2, 2019
Member
I'm not sure how it is rendered. I suggest to merge the sentences into a single paragraph.
This comment has been minimized.
This comment has been minimized.
navytux
Oct 2, 2019
Author
Ok, I've amended the patch with the following interdiff:
--- a/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.bpo-38106.4pApn7.rst
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.bpo-38106.4pApn7.rst
@@ -1,6 +1,5 @@
Fix race in PyThread_release_lock that was leading to memory corruption and
-deadlocks.
-
-The fix applies to POSIX systems where Python locks are implemented with mutex
-and condition variable because POSIX semaphores are either not provided, or are
-known to be broken. One particular example of such system is macOS.
+deadlocks. The fix applies to POSIX systems where Python locks are implemented
+with mutex and condition variable because POSIX semaphores are either not
+provided, or are known to be broken. One particular example of such system is
+macOS.
For the reference sometimes changelog entries are added as several paragraphs and it renders e.g. like this (search for "bpo-33185: Fixed regression when running pydoc with the -m switch" there).
…emory corruption and deadlock On Darwin, even though this is considered as POSIX, Python uses mutex+condition variable to implement its lock, and, as of 20190828, Py2.7 implementation, even though similar issue was fixed for Py3 in 2012, contains synchronization bug: the condition is signalled after mutex unlock while the correct protocol is to signal condition from under mutex: https://github.com/python/cpython/blob/v2.7.16-127-g0229b56d8c0/Python/thread_pthread.h#L486-L506 187aa54 (py3 fix) PyPy has the same bug for both pypy2 and pypy3: https://bitbucket.org/pypy/pypy/src/578667b3fef9/rpython/translator/c/src/thread_pthread.c#lines-443:465 https://bitbucket.org/pypy/pypy/src/5b42890d48c3/rpython/translator/c/src/thread_pthread.c#lines-443:465 Signalling condition outside of corresponding mutex is considered OK by POSIX, but in Python context it can lead to at least memory corruption if we consider the whole lifetime of python level lock. For example the following logical scenario: T1 T2 sema = Lock() sema.acquire() sema.release() sema.acquire() free(sema) ... can translate to the next C-level calls: T1 T2 # sema = Lock() sema = malloc(...) sema.locked = 0 pthread_mutex_init(&sema.mut) pthread_cond_init (&sema.lock_released) # sema.acquire() pthread_mutex_lock(&sema.mut) # sees sema.locked == 0 sema.locked = 1 pthread_mutex_unlock(&sema.mut) # sema.release() pthread_mutex_lock(&sema.mut) sema.locked = 0 pthread_mutex_unlock(&sema.mut) # OS scheduler gets in and relinquishes control from T2 # to another process ... # second sema.acquire() pthread_mutex_lock(&sema.mut) # sees sema.locked == 0 sema.locked = 1 pthread_mutex_unlock(&sema.mut) # free(sema) pthread_mutex_destroy(&sema.mut) pthread_cond_destroy (&sema.lock_released) free(sema) # ... e.g. malloc() which returns memory where sema was ... # OS scheduler returns control to T2 # sema.release() continues # # BUT sema was already freed and writing to anywhere # inside sema block CORRUPTS MEMORY. In particular if # _another_ python-level lock was allocated where sema # block was, writing into the memory can have effect on # further synchronization correctness and in particular # lead to deadlock on lock that was next allocated. pthread_cond_signal(&sema.lock_released) Note that T2.pthread_cond_signal(&sema.lock_released) CORRUPTS MEMORY as it is called when sema memory was already freed and is potentially reallocated for another object. The fix is to move pthread_cond_signal to be done under corresponding mutex: # sema.release() pthread_mutex_lock(&sema.mut) sema.locked = 0 pthread_cond_signal(&sema.lock_released) pthread_mutex_unlock(&sema.mut) To do so this patch cherry-picks thread_pthread.h part of the following 3.2 commit: commit 187aa54 Author: Kristján Valur Jónsson <kristjan@ccpgames.com> Date: Tue Jun 5 22:17:42 2012 +0000 Signal condition variables with the mutex held. Destroy condition variables before their mutexes. Python/ceval_gil.h | 9 +++++---- Python/thread_pthread.h | 15 +++++++++------ 2 files changed, 14 insertions(+), 10 deletions(-) (ceval_gil.h is Python3 specific and does not apply to Python2.7) -------- Bug history The bug was there since 1994 - since at least [1]. It was discussed in 2001 with original code author[2], but the code was still considered to be race-free. In 2010 the place where pthread_cond_signal should be - before or after pthread_mutex_unlock - was discussed with the rationale to avoid threads bouncing[3,4,5], and in 2012 pthread_cond_signal was moved to be called from under mutex, but only for CPython3[6,7]. In 2019 the bug was (re-)discovered while testing Pygolang[8] on macOS with CPython2 and PyPy2 and PyPy3. [1] 2c8cb9f [2] https://bugs.python.org/issue433625 [3] https://bugs.python.org/issue8299#msg103224 [4] https://bugs.python.org/issue8410#msg103313 [5] https://bugs.python.org/issue8411#msg113301 [6] https://bugs.python.org/issue15038#msg163187 [7] 187aa54 [8] https://pypi.org/project/pygolang
This comment has been minimized.
This comment has been minimized.
Thanks for review. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Until now libgolang had semaphore and mutex functionality, but implemented only internally and not exposed to users. Since for its pinner thread wendelin.core v2 needs not only nogil channels, but also nogil mutexes, timers and so on, it is now time to start providing that. We start with mutexes: - Expose Mutex from insides of libgolang.cpp to public API in libgolang.h . We actually expose both Mutex and Sema because semaphores are sometimes also useful for low-level synchronization, and because it is easier to export Mutex and Sema both at the same time. - Provide pyx/nogil API to those as sync.Mutex and sync.Sema. - Provide corresponding python wrappers. - Add Pyx-level test for semaphore wakeup when wakeup is done not by the same thread which did the original acquire. This is the test that was promised in 5142460 (libgolang/thread: Add links to upstream PyThread_release_lock bug), and it used to corrupt memory and deadlock on macOS due to CPython & PyPy runtime bugs: https://bugs.python.org/issue38106 -> python/cpython#16047 https://bitbucket.org/pypy/pypy/issues/3072 Note about python wrappers: At present, due to cython/cython#3165, C-level panic is not properly translated into Py-level exception in (Py)Sema/Mutex constructors. This should not pose a real problem in practice though.
This comment has been minimized.
This comment has been minimized.
For the reference: a test for this issue is included in pygolang: https://lab.nexedi.com/kirr/pygolang/commit/34b7a1f4. |
navytux commentedSep 12, 2019
•
edited by bedevere-bot
On Darwin, even though this is considered as POSIX, Python uses
mutex+condition variable to implement its lock, and, as of 20190828, Py2.7
implementation, even though similar issue was fixed for Py3 in 2012, contains
synchronization bug: the condition is signalled after mutex unlock while the
correct protocol is to signal condition from under mutex:
https://github.com/python/cpython/blob/v2.7.16-127-g0229b56d8c0/Python/thread_pthread.h#L486-L506
187aa54 (py3 fix)
PyPy has the same bug for both pypy2 and pypy3:
https://bitbucket.org/pypy/pypy/src/578667b3fef9/rpython/translator/c/src/thread_pthread.c#lines-443:465
https://bitbucket.org/pypy/pypy/src/5b42890d48c3/rpython/translator/c/src/thread_pthread.c#lines-443:465
Signalling condition outside of corresponding mutex is considered OK by
POSIX, but in Python context it can lead to at least memory corruption if we
consider the whole lifetime of python level lock. For example the following
logical scenario:
can translate to the next C-level calls:
Note that T2.pthread_cond_signal(&sema.lock_released) CORRUPTS MEMORY as it
is called when sema memory was already freed and is potentially
reallocated for another object.
The fix is to move pthread_cond_signal to be done under corresponding mutex:
To do so this patch cherry-picks thread_pthread.h part of the following 3.2 commit:
(ceval_gil.h is Python3 specific and does not apply to Python2.7)
Bug history
The bug was there since 1994 - since at least [1]. It was discussed in 2001
with original code author[2], but the code was still considered to be
race-free. In 2010 the place where pthread_cond_signal should be - before or
after pthread_mutex_unlock - was discussed with the rationale to avoid
threads bouncing[3,4,5], and in 2012 pthread_cond_signal was moved to be
called from under mutex, but only for CPython3[6,7].
In 2019 the bug was (re-)discovered while testing Pygolang[8] on macOS with
CPython2 and PyPy2 and PyPy3.
[1] 2c8cb9f
[2] https://bugs.python.org/issue433625
[3] https://bugs.python.org/issue8299#msg103224
[4] https://bugs.python.org/issue8410#msg103313
[5] https://bugs.python.org/issue8411#msg113301
[6] https://bugs.python.org/issue15038#msg163187
[7] 187aa54
[8] https://pypi.org/project/pygolang
https://bugs.python.org/issue38106