Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.7] bpo-38106: Fix race in PyThread_release_lock that can lead to memory corruption and deadlock #16047

Merged
merged 1 commit into from Oct 3, 2019

Conversation

@navytux
Copy link

navytux commented Sep 12, 2019

On Darwin, even though this is considered as POSIX, Python uses
mutex+condition variable to implement its lock, and, as of 20190828, Py2.7
implementation, even though similar issue was fixed for Py3 in 2012, contains
synchronization bug: the condition is signalled after mutex unlock while the
correct protocol is to signal condition from under mutex:

https://github.com/python/cpython/blob/v2.7.16-127-g0229b56d8c0/Python/thread_pthread.h#L486-L506
187aa54 (py3 fix)

PyPy has the same bug for both pypy2 and pypy3:

https://bitbucket.org/pypy/pypy/src/578667b3fef9/rpython/translator/c/src/thread_pthread.c#lines-443:465
https://bitbucket.org/pypy/pypy/src/5b42890d48c3/rpython/translator/c/src/thread_pthread.c#lines-443:465

Signalling condition outside of corresponding mutex is considered OK by
POSIX, but in Python context it can lead to at least memory corruption if we
consider the whole lifetime of python level lock. For example the following
logical scenario:

      T1                                          T2

  sema = Lock()
  sema.acquire()

                                              sema.release()

  sema.acquire()
  free(sema)

  ...

can translate to the next C-level calls:

      T1                                          T2

  # sema = Lock()
  sema = malloc(...)
  sema.locked = 0
  pthread_mutex_init(&sema.mut)
  pthread_cond_init (&sema.lock_released)

  # sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)

                                              # sema.release()
                                              pthread_mutex_lock(&sema.mut)
                                              sema.locked = 0
                                              pthread_mutex_unlock(&sema.mut)

                      # OS scheduler gets in and relinquishes control from T2
                      # to another process
                                              ...

  # second sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)

  # free(sema)
  pthread_mutex_destroy(&sema.mut)
  pthread_cond_destroy (&sema.lock_released)
  free(sema)

  # ...
  e.g. malloc() which returns memory where sema was

                                              ...
                      # OS scheduler returns control to T2
                      # sema.release() continues
                      #
                      # BUT sema was already freed and writing to anywhere
                      # inside sema block CORRUPTS MEMORY. In particular if
                      # _another_ python-level lock was allocated where sema
                      # block was, writing into the memory can have effect on
                      # further synchronization correctness and in particular
                      # lead to deadlock on lock that was next allocated.
                                              pthread_cond_signal(&sema.lock_released)

Note that T2.pthread_cond_signal(&sema.lock_released) CORRUPTS MEMORY as it
is called when sema memory was already freed and is potentially
reallocated for another object.

The fix is to move pthread_cond_signal to be done under corresponding mutex:

  # sema.release()
  pthread_mutex_lock(&sema.mut)
  sema.locked = 0
  pthread_cond_signal(&sema.lock_released)
  pthread_mutex_unlock(&sema.mut)

To do so this patch cherry-picks thread_pthread.h part of the following 3.2 commit:

commit 187aa545165d8d5eac222ecce29c8a77e0282dd4
Author: Kristján Valur Jónsson <kristjan@ccpgames.com>
Date:   Tue Jun 5 22:17:42 2012 +0000

    Signal condition variables with the mutex held.  Destroy condition variables
    before their mutexes.

 Python/ceval_gil.h      |  9 +++++----
 Python/thread_pthread.h | 15 +++++++++------
 2 files changed, 14 insertions(+), 10 deletions(-)

(ceval_gil.h is Python3 specific and does not apply to Python2.7)


Bug history

The bug was there since 1994 - since at least [1]. It was discussed in 2001
with original code author[2], but the code was still considered to be
race-free. In 2010 the place where pthread_cond_signal should be - before or
after pthread_mutex_unlock - was discussed with the rationale to avoid
threads bouncing[3,4,5], and in 2012 pthread_cond_signal was moved to be
called from under mutex, but only for CPython3[6,7].

In 2019 the bug was (re-)discovered while testing Pygolang[8] on macOS with
CPython2 and PyPy2 and PyPy3.

[1] 2c8cb9f
[2] https://bugs.python.org/issue433625
[3] https://bugs.python.org/issue8299#msg103224
[4] https://bugs.python.org/issue8410#msg103313
[5] https://bugs.python.org/issue8411#msg113301
[6] https://bugs.python.org/issue15038#msg163187
[7] 187aa54
[8] https://pypi.org/project/pygolang

https://bugs.python.org/issue38106

@the-knights-who-say-ni

This comment has been minimized.

Copy link

the-knights-who-say-ni commented Sep 12, 2019

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@navytux

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@navytux

This comment has been minimized.

Copy link
Author

navytux commented Sep 12, 2019

I have signed PSF CLA, but only 30 minutes ago.

@navytux navytux force-pushed the navytux:2.7-pylock-release-fix branch from 2445211 to 9b135c0 Sep 12, 2019
@navytux

This comment has been minimized.

Copy link
Author

navytux commented Sep 12, 2019

I've amended the patch to make bedevere/news happy with the following Misc/News.d/ interdiff:

--- /dev/null
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst       
@@ -0,0 +1,2 @@
+Fix race in PyThread_release_lock that was leading to memory corruption and
+deadlocks.
navytux added a commit to navytux/pygolang that referenced this pull request Sep 17, 2019
I reported the bug (see 69db91b "libgolang: Add internal
semaphores") to CPython and PyPy upstreams. PyPy people already fixed it
and the fix, if I understood correctly, should be available as part of
PyPy 7.2 . Let's hope that CPython 2.7 will be also fixed.

- https://bugs.python.org/issue38106
    -> python/cpython#16047
- https://bitbucket.org/pypy/pypy/issues/3072

Added Python-level test triggers the bug with high probability.
Dedicated C-level Semaphore-only test is pending.
@@ -0,0 +1,2 @@
Fix race in PyThread_release_lock that was leading to memory corruption and
deadlocks.

This comment has been minimized.

Copy link
@vstinner

vstinner Oct 2, 2019

Member

Oh! I just created a very similar local change to prepare a pull request... and then I saw that you created a PR! Oops.

I wrote a different description for the change:

    bpo-38106: Fix a race condition in pthread locks
    
    Fix a race condition in locks of the pthread implementation using
    mutex with condition variable. Signal condition variables with the
    mutex held. Destroy condition variables before their mutexes.

Please mention which exact implementation is changed. We do have 2 implementations for pthread :-)

@navytux navytux force-pushed the navytux:2.7-pylock-release-fix branch from 9b135c0 to 5164ee7 Oct 2, 2019
@navytux

This comment has been minimized.

Copy link
Author

navytux commented Oct 2, 2019

@vstinner thanks for feedback and for noticing my PR. I agree it makes sense to clarify in NEWS about which systems are affected. I've amended the patch with the following interdiff:

--- a/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst       
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.[bpo-38106](https://bugs.python.org/issue38106).4pApn7.rst       
@@ -1,2 +1,6 @@
 Fix race in PyThread_release_lock that was leading to memory corruption and
 deadlocks.
+
+The fix applies to POSIX systems where Python locks are implemented with mutex
+and condition variable because POSIX semaphores are either not provided, or are
+known to be broken. One particular example of such system is macOS.

Is it ok?

Copy link
Member

vstinner left a comment

LGTM. I would prefer to have a review of a second core dev.

@pablogsal, @serhiy-storchaka, @methane, @benjaminp: Would you mind to have a look at this fix for a lock race condition?

@@ -0,0 +1,6 @@
Fix race in PyThread_release_lock that was leading to memory corruption and
deadlocks.

This comment has been minimized.

Copy link
@vstinner

vstinner Oct 2, 2019

Member

I'm not sure how it is rendered. I suggest to merge the sentences into a single paragraph.

This comment has been minimized.

Copy link
@navytux

navytux Oct 2, 2019

Author

Ok, I've amended the patch with the following interdiff:

--- a/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.bpo-38106.4pApn7.rst       
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-09-12-16-36-16.bpo-38106.4pApn7.rst       
@@ -1,6 +1,5 @@
 Fix race in PyThread_release_lock that was leading to memory corruption and
-deadlocks.
-
-The fix applies to POSIX systems where Python locks are implemented with mutex
-and condition variable because POSIX semaphores are either not provided, or are
-known to be broken. One particular example of such system is macOS.
+deadlocks. The fix applies to POSIX systems where Python locks are implemented
+with mutex and condition variable because POSIX semaphores are either not
+provided, or are known to be broken. One particular example of such system is
+macOS.

For the reference sometimes changelog entries are added as several paragraphs and it renders e.g. like this (search for "bpo-33185: Fixed regression when running pydoc with the -m switch" there).

…emory corruption and deadlock

On Darwin, even though this is considered as POSIX, Python uses
mutex+condition variable to implement its lock, and, as of 20190828, Py2.7
implementation, even though similar issue was fixed for Py3 in 2012, contains
synchronization bug: the condition is signalled after mutex unlock while the
correct protocol is to signal condition from under mutex:

  https://github.com/python/cpython/blob/v2.7.16-127-g0229b56d8c0/Python/thread_pthread.h#L486-L506
  187aa54 (py3 fix)

PyPy has the same bug for both pypy2 and pypy3:

  https://bitbucket.org/pypy/pypy/src/578667b3fef9/rpython/translator/c/src/thread_pthread.c#lines-443:465
  https://bitbucket.org/pypy/pypy/src/5b42890d48c3/rpython/translator/c/src/thread_pthread.c#lines-443:465

Signalling condition outside of corresponding mutex is considered OK by
POSIX, but in Python context it can lead to at least memory corruption if we
consider the whole lifetime of python level lock. For example the following
logical scenario:

      T1                                          T2

  sema = Lock()
  sema.acquire()

                                              sema.release()

  sema.acquire()
  free(sema)

  ...

can translate to the next C-level calls:

      T1                                          T2

  # sema = Lock()
  sema = malloc(...)
  sema.locked = 0
  pthread_mutex_init(&sema.mut)
  pthread_cond_init (&sema.lock_released)

  # sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)

                                              # sema.release()
                                              pthread_mutex_lock(&sema.mut)
                                              sema.locked = 0
                                              pthread_mutex_unlock(&sema.mut)

                      # OS scheduler gets in and relinquishes control from T2
                      # to another process
                                              ...

  # second sema.acquire()
  pthread_mutex_lock(&sema.mut)
  # sees sema.locked == 0
  sema.locked = 1
  pthread_mutex_unlock(&sema.mut)

  # free(sema)
  pthread_mutex_destroy(&sema.mut)
  pthread_cond_destroy (&sema.lock_released)
  free(sema)

  # ...
  e.g. malloc() which returns memory where sema was

                                              ...
                      # OS scheduler returns control to T2
                      # sema.release() continues
                      #
                      # BUT sema was already freed and writing to anywhere
                      # inside sema block CORRUPTS MEMORY. In particular if
                      # _another_ python-level lock was allocated where sema
                      # block was, writing into the memory can have effect on
                      # further synchronization correctness and in particular
                      # lead to deadlock on lock that was next allocated.
                                              pthread_cond_signal(&sema.lock_released)

Note that T2.pthread_cond_signal(&sema.lock_released) CORRUPTS MEMORY as it
is called when sema memory was already freed and is potentially
reallocated for another object.

The fix is to move pthread_cond_signal to be done under corresponding mutex:

  # sema.release()
  pthread_mutex_lock(&sema.mut)
  sema.locked = 0
  pthread_cond_signal(&sema.lock_released)
  pthread_mutex_unlock(&sema.mut)

To do so this patch cherry-picks thread_pthread.h part of the following 3.2 commit:

commit 187aa54
Author: Kristján Valur Jónsson <kristjan@ccpgames.com>
Date:   Tue Jun 5 22:17:42 2012 +0000

    Signal condition variables with the mutex held.  Destroy condition variables
    before their mutexes.

 Python/ceval_gil.h      |  9 +++++----
 Python/thread_pthread.h | 15 +++++++++------
 2 files changed, 14 insertions(+), 10 deletions(-)

(ceval_gil.h is Python3 specific and does not apply to Python2.7)

--------

Bug history

The bug was there since 1994 - since at least [1]. It was discussed in 2001
with original code author[2], but the code was still considered to be
race-free. In 2010 the place where pthread_cond_signal should be - before or
after pthread_mutex_unlock - was discussed with the rationale to avoid
threads bouncing[3,4,5], and in 2012 pthread_cond_signal was moved to be
called from under mutex, but only for CPython3[6,7].

In 2019 the bug was (re-)discovered while testing Pygolang[8] on macOS with
CPython2 and PyPy2 and PyPy3.

[1] 2c8cb9f
[2] https://bugs.python.org/issue433625
[3] https://bugs.python.org/issue8299#msg103224
[4] https://bugs.python.org/issue8410#msg103313
[5] https://bugs.python.org/issue8411#msg113301
[6] https://bugs.python.org/issue15038#msg163187
[7] 187aa54
[8] https://pypi.org/project/pygolang
@navytux navytux force-pushed the navytux:2.7-pylock-release-fix branch from 5164ee7 to 7c11834 Oct 2, 2019
@navytux

This comment has been minimized.

Copy link
Author

navytux commented Oct 2, 2019

LGTM

Thanks for review.

@methane
methane approved these changes Oct 3, 2019
@vstinner vstinner merged commit c5abd63 into python:2.7 Oct 3, 2019
4 checks passed
4 checks passed
bedevere/issue-number Issue number 38106 found
Details
bedevere/maintenance-branch-pr Valid maintenance branch PR title.
bedevere/news News entry found in Misc/NEWS.d
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@vstinner

This comment has been minimized.

Copy link
Member

vstinner commented Oct 3, 2019

Thank you very much for this amazing commit message @navytux :-D I made some minor edits: I added manually "Co-Authored-By" and "(cherry picked from commit 187aa54)" markups.

@navytux

This comment has been minimized.

Copy link
Author

navytux commented Oct 3, 2019

@vstinner, @methane, thanks again for feedback and merging. The edits also look good - thanks for adding them. Python2.7 on macOS should be now, hopefully, in a better shape.

Thanks again,
Kirill

@navytux navytux deleted the navytux:2.7-pylock-release-fix branch Oct 3, 2019
navytux added a commit to navytux/pygolang that referenced this pull request Oct 4, 2019
Until now libgolang had semaphore and mutex functionality, but
implemented only internally and not exposed to users. Since for its
pinner thread wendelin.core v2 needs not only nogil channels, but also
nogil mutexes, timers and so on, it is now time to start providing that.

We start with mutexes:

- Expose Mutex from insides of libgolang.cpp to public API in
  libgolang.h . We actually expose both Mutex and Sema because
  semaphores are sometimes also useful for low-level synchronization,
  and because it is easier to export Mutex and Sema both at the same time.

- Provide pyx/nogil API to those as sync.Mutex and sync.Sema.

- Provide corresponding python wrappers.

- Add Pyx-level test for semaphore wakeup when wakeup is done not by the
  same thread which did the original acquire. This is the test that was
  promised in 5142460 (libgolang/thread: Add links to upstream
  PyThread_release_lock bug), and it used to corrupt memory and deadlock
  on macOS due to CPython & PyPy runtime bugs:

    https://bugs.python.org/issue38106
      -> python/cpython#16047
    https://bitbucket.org/pypy/pypy/issues/3072

Note about python wrappers: At present, due to
cython/cython#3165, C-level panic is not
properly translated into Py-level exception in (Py)Sema/Mutex constructors.
This should not pose a real problem in practice though.
@navytux

This comment has been minimized.

Copy link
Author

navytux commented Dec 11, 2019

For the reference: a test for this issue is included in pygolang: https://lab.nexedi.com/kirr/pygolang/commit/34b7a1f4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.