Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

bruxisma · 2024-08-19T05:11:50Z

Bug report

Bug description:

Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.

clang++ 18.1.7 for x86_64-pc-windows-msvc
Visual Studio Build Tools 2022 17.11.0
CMake 3.30
Python 3.12.5
ninja 1.12.1

When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.

The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.

Debug memory block at address p=00000267272D6030: API 'p'
    16718206241729413120 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0xa0 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00Hello, world! From Thread 31
Hello, world! From Thread 87
 *** OUCH
        at p-4: 0xfd
Hello, world! From Thread 43
Hello, world! From Thread 279
Generating thread state 314
        at p-3: 0xfd
        at p-2: 0xfd
Hello, world! From Thread 168
Generating thread state 315
        at p-1: 0xfd
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
Generating thread state 316
Generating thread state 317
    The 8 pad bytes at tail=E8030267272D6030 are

The output cut off after this, as the entire program crashed, taking my terminal with it 😅

You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of _PyObject_DebugDumpAddress, based on what debugging I was able to do with WinDbg.

Important

std::jthread calls .join() on destruction, so all threads auto-join once the std::vector goes out of scope.

Additionally this code exhibits the same behavior regardless of whether it is a thread_local or declared within the lambda passed to std::thread

main.cxx

#include <vector>
#include <thread>
#include <cstdlib>
#include <print>

#include <Python.h>

namespace {

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463; 
static inline constexpr auto config = PyInterpreterConfig {
  .use_main_obmalloc = 0,
  .allow_fork = 0,
  .allow_exec = 0,
  .allow_threads = 0,
  .allow_daemon_threads = 0,
  .check_multi_interp_extensions = 1,
  .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
    });
  }
}

int main() {
  PyConfig config {};
  PyConfig_InitIsolatedConfig(&config);
  if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status)) {
    std::println("Failed to initialize with isolated config: {}", status.err_msg);
    return EXIT_FAILURE;
  }
  PyConfig_Clear(&config);
  execute();
  Py_Finalize();
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.30)
project(463-interpreters LANGUAGES C CXX)

find_package(Python 3.12 REQUIRED COMPONENTS Development.Embed)

set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")

add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)

Command to build + run

$ cmake -Bbuild -S. -G "Ninja"
$ cmake --build build && .\build\463-interpreters.exe

CPython versions tested on:

3.12

Operating systems tested on:

Windows

The text was updated successfully, but these errors were encountered:

picnixz · 2024-08-19T08:45:07Z

Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

ZeroIntensity · 2024-08-19T20:17:43Z

At a glance, this could possibly be related to a memory allocation failure or something similar.

  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);

These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

bruxisma · 2024-08-19T20:23:15Z

At a glance, this could possibly be related to a memory allocation failure or something similar.
  auto globals = PyDict_New();
  auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
  auto result = PyEval_EvalCode(code, globals, globals);
These lines need a NULL check if an error occurred, otherwise there will be a NULL pointer dereference in the call to Py_DecRef.

Py_DecRef is the function version of Py_XDECREF. Though I can see the confusion between Py_DecRef and Py_DECREF. If there was a nullptr, we'd have a different set of errors, and the debug memory assertions wouldn't be complaining that all memory was not written with the FORBIDDENBYTE value.

bruxisma · 2024-08-19T20:25:46Z

Can you reproduce this behaviour on Linux? or MacOS?
I will work on getting the Linux/macOS behaviors checked when I can. My WSL2 environment is woefully out of date 😅. macOS might be more difficult, as I'm unaware of how the _d version of the Python.framework might be linked against.

Or what happens if you sleep a bit before spawning the sub-interpreters?

Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues

I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/

This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals

Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available)

I will also work on giving this a go.

ZeroIntensity · 2024-08-19T20:26:31Z

Py_DecRef is the function version of Py_XDECREF

Oh! TIL. I still would suggest adding a NULL check -- an error indicator could be set, which would break subsequent calls to PyErr_Occurred.

bruxisma · 2024-08-19T20:42:35Z

The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a PyThreadState) or on subinterpreter destruction (Py_EndInterpreter) does the error occur (typically the latter). PyErr_Occurred is not being called anywhere in the MRE, nor in the actual traceback for the failure.

ZeroIntensity · 2024-08-19T20:49:18Z

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

bruxisma · 2024-08-20T02:52:54Z

@picnixz I can confirm the behavior still exists on Windows on HEAD. Same errors can occur

Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific.

PyErr_Occurred only performs a read on the current thread state object. If an error is present when it shouldn't be that still points to a memory corruption error occurring between the destruction and initialization of a PyThreadState.

bruxisma · 2024-08-20T08:09:15Z

Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux.

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

WARNING: ThreadSanitizer: data race (pid=11259)
  Write of size 8 at 0x7f24e35e16c0 by thread T10:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
    ...
  Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
    #0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
    #2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
    #3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
    #4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)

  Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880)

This only happened 3 times, and around the 42 thread generation step on average.

TBBle · 2024-08-20T10:42:37Z

However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place.

A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables.

This module is used on Windows, but I'm not sure if any of the macros enabling those qsort calls (HAVE_FPATHCONF, HAVE_PATHCONF, HAVE_CONFSTR, or HAVE_SYSCONF) are enabled in the Windows build. My guess would be no...

I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. (qsort_r doesn't help, that makes the comparison function thread-safe, not the array being sorted, according to the manpage.)

So probably a separate issue, and a more-focused/reliable MRE could be made for it.

ZeroIntensity · 2024-08-20T10:50:15Z

To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them.

cc @ericsnowcurrently for insight

TBBle · 2024-08-20T11:36:58Z

I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug.

The check_multi_interp_extensions flag has also been set, so any modules known to not work with per-subinterpreter GIL should refuse to import before running any init code anyway.

Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue.

Indeed, calling Py_NewInterpreterFromConfig documents that the GIL must be held for that call, and the own-GIL case does interesting things with the GIL-holding state of both new and current interpreter.

Like all other Python/C API functions, the global interpreter lock must be held before calling this function and is still held when it returns. Likewise a current thread state must be set on entry. On success, the returned thread state will be set as current. If the sub-interpreter is created with its own GIL then the GIL of the calling interpreter will be released. When the function returns, the new interpreter’s GIL will be held by the current thread and the previously interpreter’s GIL will remain released here.

So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to Py_Finalize should do that too, in that case). Poking around; because it's a fresh thread, maybe it doesn't need the main GIL to call Py_NewInterpreterFromConfig, and so this code is GIL-safe as is. I'm not clear if the new per-subinterpreter-GIL support changed things around here, and if so, how it changed.

ZeroIntensity · 2024-08-20T13:06:42Z

I'm guessing that's the problem Judging by the comment, it looks like new_interpreter doesn't support being called without the GIL:

cpython/Python/pylifecycle.c

Lines 2261 to 2263 in bffed80

    
           // XXX Might new_interpreter() have been called without the GIL held? 
        
           PyThreadState *save_tstate = _PyThreadState_GET(); 
        
           PyThreadState *tstate = NULL;

I could see this being a bug, though. I'll wait for Eric to clarify.

bruxisma · 2024-08-20T18:39:32Z

Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅

ZeroIntensity · 2024-08-20T18:52:40Z

Does this occur on the free threaded build?

bruxisma · 2024-08-24T08:17:22Z

Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance.

bruxisma · 2024-08-25T21:04:13Z

@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD

ZeroIntensity · 2024-08-25T22:59:21Z

For now, I think I'm going to write this off as user-error, as Py_NewInterpreterFromConfig (and friends) doesn't seem to support being called without the GIL. Though, it's a bit confusing as to why this only fails on Windows. Does it still occur if you call PyGILState_Ensure (and subsequently PyGILState_Release) before initializing the subinterpreter?

bruxisma · 2024-08-25T23:04:14Z

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

You cannot call PyGILState_Release on these threads because when you call Py_NewInterpreterFromConfig it sets the new interpreter as the current one for said thread. This would require swapping to the old state to unlock it and then swapping back to the new state.

ZeroIntensity · 2024-08-25T23:08:29Z

How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created?

AFAIK, PyGILState_* calls don't actually acquire the GIL on free-threaded builds, they just initialize the thread state structure to allow calling the API.

At a glance, the drop-in code would be:

void execute () {
  std::vector<std::jthread> tasks { };
  tasks.reserve(MAX_STATES);
  for (auto count = 0zu; count < tasks.capacity(); count++) {
    std::println("Generating thread state {}", count);
    tasks.emplace_back([count] {
      PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
      if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
        std::println("Failed to initialize thread state {}", count);
        PyGILState_Release(gil);
        return;
      }
      auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
      auto globals = PyDict_New();
      auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
      auto result = PyEval_EvalCode(code, globals, globals);
      Py_DecRef(result);
      Py_DecRef(code);
      Py_DecRef(globals);
      Py_EndInterpreter(state);
      state = nullptr;
      PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
    });
  }
}

There needs to be a Py_BEGIN_ALLOW_THREADS in there, though. Normally, that would happen over the calls to join the threads, but I'm assuming that happens in a destructor here -- I'm not too familiar with C++ threading, you'll have to figure out where to put that.

bruxisma · 2024-08-26T00:09:15Z

It shouldn't require any of that. The interpreter variable is local to the thread. The Py_BEGIN_ALLOW_THREADS macro is required when manipulating the GIL. There is no GIL manipulation needed because each of these are isolated threads running on their own, and even when the GIL is disabled there shouldn't be any manipulation of the GIL occurring regardless because it's completely turned off

The need for PyGILState_STATE goes away when using isolated subinterpreters. The state is only needed if you're executing these subinterpreters to talk to each other. But these interpreters are running entirely on their own with no interaction between each other. Were I trying to register additional threads under a specific interpreter instance, then your code would be warranted. However, the documentation for these functions clearly says

Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but mixing multiple interpreters and the PyGILState_ API is unsupported*.

(emphasis mine)

So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here.

ZeroIntensity · 2024-08-26T00:37:44Z

It shouldn't require any of that. The interpreter variable is local to the thread.

Yes, but how do you initialize that interpreter variable? I don't think new_interpreter does it (unless I'm missing something -- please clarify!)

I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from new_interpreter) that you need to hold it in order to create a subinterpreter. But that could be wrong, I'm just going off a comment. I'm still not sure why it only happens after a certain number of threads -- it should happen on the first thread, and it certainly shouldn't be Windows-specific.

bruxisma · 2024-08-26T00:45:30Z

yes it is initialized. that's what auto status = Py_NewInterpreterFromConfig(&state, &config); does. If it didn't work status would be an error.

I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing.

That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment ThreadSanitizer found some race conditions brought on by qsort on Linux.

ZeroIntensity · 2024-08-26T00:53:07Z

If it didn't work status would be an error.

It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in new_interpreter it initializes the thread state information, so I would think that the failure would be in _PyThreadState_GET. However, that doesn't seem like that's the case, somehow.

Speculatively, the race could be caused by here

cpython/Python/pylifecycle.c

Line 2251 in bffed80

runtime->gilstate.check_enabled = 0;

Trying to set this across multiple threads could be problematic, since there's no lock.

bruxisma · 2024-08-26T00:57:19Z

If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the _d python library is used.

ZeroIntensity · 2024-08-26T01:07:49Z

The whole issue overall screams data race to me, it's quite odd that this only occurs on Windows debug builds. I would guess that this is present in release builds as well, just maybe not with this reproducer. Regardless, I'm like 70% sure that a lock should be acquired when writing to the runtime state, through HEAD_LOCK. It might not be the cause of this issue, but that looks like a bug to me nonetheless. Does adding HEAD_LOCK and HEAD_UNLOCK to new_intepreter fix the problem?

TBBle · 2024-08-26T13:14:16Z

Having a quick poke around (in 3.13 on GitHub): new_interpreter initialises the PyThreadState** passed in, or nulls it on failure. It appears fully-prepared to handle a null current-thread state and "new interpreter has its own GIL" state (although the comments for this logic appear to still assume that this only happens in the first-time "main interpreter" case) and HEAD_LOCK is used around the logic that adds the new interpreter to the global list of interpreters. I don't think that single write to disable GIL checking needs a lock, if this was going wrong somehow,

Basically, the things that PyGILState_Ensure does to make a thread Python-enabled are also done by new_interpreter. There's a call to bind_gilstate_tstate in both paths which stores the tstate into the TLS variable which the PyGILState_* API relies upon.

So I don't think the original code breaks the actual preconditions, at least that I noticed in browsing, and hence the docs that suggest a GIL must exist for calls to Py_NewInterpreterFromConfig probably need to be clarified to note that this only applies to Python-enabled threads, and that on a "Non-Python created" thread, you don't need a current thread state, and actually can't have one, as the non-existence of a thread state is what defines a "non-Python created" thread.

The comment "but mixing multiple interpreters and the PyGILState_* API is unsupported." is also a "here be dragons" coarse version of the clearer comment in the per-interpreter GIL docs:

Also note that combining this functionality with PyGILState_* APIs is delicate, because these APIs assume a bijection between Python thread states and OS-level threads, an assumption broken by the presence of sub-interpreters. It is highly recommended that you don’t switch sub-interpreters between a pair of matching PyGILState_Ensure() and PyGILState_Release() calls. Furthermore, extensions (such as ctypes) using these APIs to allow calling of Python code from non-Python created threads will probably be broken when using sub-interpreters.

The converse of that last paragraph is that PyGILState_* should work fine when you are tying sub-interpreters to OS-level threads, as seen in this example. So we're not even in conflict with the docs, there's just some parts of the documentation that don't explicitly take account of the "one-(sub)interpreter-per-thread" case.

#59956 (comment) seems to agree with the above, as it notes that PyGILState_Ensure() should no longer crash or do the wrong thing with sub-interpreters, but does not yet support controlling which interpreter is used when it creates a new thread state, that always uses the main interpreter. That wouldn't be a problem in the structure used here, as Py_NewInterpreterFromConfig is creating the new thread state, and then any further theoretical PyGILState_* calls are safe and will use that interpreter, from the current threadstate which points to it. So I don't expect problems from extension modules manipulating the GIL in this structure either, unless they are spinning up their own threads and calling PyGILState_* APIs. (That's #59956's decade-old unresolved use-case, so it's not new with any of the recent changes.)

That said, adding PyGILState_Ensure() before Py_NewInterpreterFromConfig would not be incorrect (just wasteful...), and if that fixed things, would help point towards the source of the issue. (Conversely, if it doesn't help, that probably rules out a class of issues). Py_NewInterpreterFromConfig is documented as releasing the GIL held when it was entered so no matching PyGILState_Release() or Py_<VERB>_ALLOW_THREADS calls should be needed. I couldn't immediately see where this was done from new_interpreter though, so I'm trusting the docs on that one.

ZeroIntensity · 2024-08-26T13:39:16Z

After some debugging, it seems I was wrong! HEAD_LOCK and PyGILState_Ensure don't seem to fix the data race on Linux.

TBBle · 2024-08-26T16:21:23Z

Looking at the memory dump data from p=00000267272D6030, the pre-buffer bytes appear to be E8 03 00 00 00 00 00 00 70 a0 00 00 fd fd fd fd. That should be a big-endian size_t "bytes-allocated", then one of the chars 'r', 'm', or 'o', followed by seven 0xfd.

Instead it looks like we got overwritten with a 32-bit little-endian 0x00000000000003e8 (1000) followed by 16-bit 0x0000a70 (2672). (That first one could also be 16-bit 1000 followed by 16-bit 0, no way to tell. We don't see the preceding bytes either. They don't look like ASCII to me, anyway.) The 1000 is nice and round, but neither one jumps out with any idea of what might be overwriting that data.

So no obvious clue as to whether this only happens in the debug build, or if it's only detected in the debug build. In release builds there's no bookkeeping in live blocks or release-time validity checking, so if the overwrite is happening in non-debug builds we may never see evidence for it. The first thing in a PyObject is the ssize_t ref count, so in release-time builds, an overwrite like this would prevent the overwritten object from being deallocated, but not otherwise mess with anything off-hand. (GIL-disabled builds look like they would be more-damaged by overwriting the first uintptr_t of the object, but it may not be visible depending on the overwrite and the structure of that uintptr_t.)

Since TSAN isn't flagging this, I suspect these are overlapping objects from the same thread: an object is either being allocated from the wrong pool, or is writing past its end-bounds and hitting the next object in the pool. (Perhaps the damaged object is raw-alloc'd? I don't know if I'd expect it to be 16-bit aligned, but maybe I'm wrong to assume a malloc implementation would generally provide size_t-aligned blocks. The object allocator does return 16-bit aligned blocks though.)

Anyway, I expect if you can catch the failure in a debugger, you could walk backwards a bit in memory and find the actual object causing the problem, if that is a PyObject, and it's not itself corrupted. Since this appears to be being triggered by the sub-interpreters, I wonder if it's something in the "non-main obmalloc" support, e.g. one of the on-heap obmalloc support structures, that in the main obmalloc live inside a static object.

The preceding object probably isn't a PyObject, since that should end with a size_t of 0xfd and we can only see 4 0xfd in the dump (if the two paddings overlapped). Since the corrupted object is being deallocated, its ref-count wasn't overwritten with another four 0xfd bytes. (Or both objects are live and are busy overwriting each other's padding in turn in a nightmare of undebuggability... indebuggability... one of those must be a real word? Anyway, I think this shouldn't happen in the same thread, so barring more evidence, not worrying too much about that case.)

Which leans me harder towards "one of the obmalloc tracking structures on the heap blows out its bounds and corrupts the first-allocated PyObject from the first-created pool, which immediately follows it". That points the finger at pool_header, I think, and it helpfully ends with three uints. But I have only skimmed this code, and those are always on the heap anyway, which fails an earlier assumption I made to get this this point. And the actual values observed don't really fit those last three uints anyway. And 0x00000267272D6030 (minus two size_t's) doesn't seem right for the first block in the pool, which is a 16kiB-aligned block with 4 pointers and then 4 uints in the header structure. So for this to be the result, means there's something badly wrong with obmalloc already, and this has just revealed it.

So maybe a dead-end after all.

Is it possible to reproduce this failure with use_main_obmalloc set to true, in the freethreaded build? Or is the "Own GIL requires non-main obmalloc" rule still enforced in that case?

TBBle · 2024-08-26T16:23:55Z

After some debugging, it seems I was wrong! HEAD_LOCK and PyGILState_Ensure don't seem to fix the data race on Linux.

Wait, do you have a repro-case on Linux? Or were you looking at the TSan failure? (If the latter, put a critical section of some kind around the qsort calls and it should be fixed. Not the best fix, but at least proves or disproves the point.)

TBBle · 2024-08-27T11:53:43Z

I reproduced using the MSVC++ 17.11 Community edition compiler (under VS Code) and the clearest failure is

HEAP[463-interpreters.exe]: Invalid address specified to RtlValidateHeap( 000001F0B27D0000, 000001F0BA20AE30 )

from a PyObject_Free call, so far I've noticed it inside dict resize and list resize operations during initial "site" import during new_interpreter. The first parameter is the heap address, the second is the address being checked. I am not sure, but I believe it's failing because that address was not allocated from the heap in the first place. (The pointer passed to free is actually 0x30 later than the above address, so I assume that's MSVC heap-management metadata before the allocation; if this wasn't a heap allocation, then it's nonsense.)

Anyway, what I guess is going on is that the object is being free'd in the wrong thread, so when _PyObjectFree asks pymalloc_free to free the object, the latter comes back with "not my allocation", and so it's passed to PyMem_RawFree, which if not stopped by RtlValidateHeap as seen here, will presumably end up back on the RtlHeap free-list and (in debug builds) have its management data blatted and possibly be given back out later for someone else to stomp on as a new heap allocation. (That blatting is why I suspect this only reproduces in debug builds, as this program may not run long enough to get that block back as a new allocation.)

The free'd object is generally on-the-way-out, in the list and dict resize operations it was the "oldXXX" pointer.

I expect that the reason RtlValidateHeap fails is that this address is inside a heap allocation being used for an arena, not the start of the allocation. This repro seems happen to occur reliably with only 50 threads, and I saw it with 20 threads but not reliably, and was unable to reproduce it with 15 threads on my 16-core NUC, but I didn't try really hard. Sometimes it misses the RtlValidateHeap but fails _PyMem_DebugCheckAddress, suggesting that RtlValidateHeap isn't perfect.

I just now had a RtlValidateHeap failure happen in PyThread_free_lock twice, which is actually supposed to be a direct raw free, not a pool object being accidentally passed to _PyMem_RawFree, so I'm not sure why that failed. Even if that was freed from the wrong thread or wrong interpreter, it shouldn't have been an issue.

Edit: Thinking about this, I realised (and tested) that the repro does not require any calls between Py_NewInterpreterFromConfig and Py_EndInterpreter.

Edit again (last one tonight):

I think the following is the most-isolated example of a failure case:

marshal.c:1830 (Python 3.12)

    if ((rf.refs = PyList_New(0)) == NULL)
        return NULL;
    result = read_object(&rf);
    Py_DECREF(rf.refs);

The file itself is 44kb, in case it turns out to be something about a particular file. Although I hit it again with a smaller buffer (and smaller object being freed) so probably not.

Inside a block of C code, we create a new list, stash it somewhere not visible from anywhere else, pass it into a function, and then free it.

During the free, I'm seeing the list had 465 items in it, and the failure came when decrefing item 128 (going from the back). I don't know what type that item had, as its data was already blatted by the raw-free logic after passing _PyMem_DebugCheckAddress (so it wasn't itself blatantly corrupted), however I could see that it used PyObject_Free in its type's tp_free member, and it was an 1183-byte object, so I assume it's a var-size object that went through a realloc itself somewhere, but is not itself a container of PyObjects. (Maybe a string or buffer? It's importing from codecs or io-related modules in the marshal.c repro case instance AFAICT.)

Given the size, it would have not been in the pool allocator but should have fallen through to heap allocation, so I'm now pretty doubtful about my earlier "freeing in wrong obmalloc" theory, and unsure what exactly RtlValidateHeap is complaining about.

That said, since this is an unmarshall, I believe it can reference things that already exist, so maybe it's making itself visible to another thread by this mechanism?

ZeroIntensity · 2024-08-27T12:10:12Z

I'm wondering if the Linux and Windows parts are separate issues. The data race occurs on just two threads and seems to happen in Py_NewInterpreterFromConfig, not a deallocator.

TBBle · 2024-08-27T12:15:08Z

@ZeroIntensity
If you're talking about the TSan failure seen in posixmodule on Linux, then yes, that's a separate issue revealed by the same code. It's trivially a static array being passed to qsort from multiple threads at once, with no synchronisation. The fix used in similar other cases (search for qsort in the bug tracker) was to make the lists correctly-sorted in the source code rather than sorting during init.

ZeroIntensity · 2024-08-27T13:37:57Z

I'll create a separate issue later today then, thanks!

TBBle · 2024-08-29T16:32:03Z

Okay, I have a more-isolated example which points to maybe there just being a nasty bug in the debug allocator support somewhere.

I just got the RtlValidateHeap failure in _extensions_cache_get on the PyMem_RawFree(key); call:

static PyModuleDef *
_extensions_cache_get(PyObject *filename, PyObject *name)
{
    PyModuleDef *def = NULL;
    void *key = NULL;
    extensions_lock_acquire();

    if (EXTENSIONS.hashtable == NULL) {
        goto finally;
    }

    key = hashtable_key_from_2_strings(filename, name, HTSEP);
    if (key == NULL) {
        goto finally;
    }
    _Py_hashtable_entry_t *entry = _Py_hashtable_get_entry(
            EXTENSIONS.hashtable, key);
    if (entry == NULL) {
        goto finally;
    }
    def = (PyModuleDef *)entry->value;

finally:
    extensions_lock_release();
    if (key != NULL) {
        PyMem_RawFree(key);
    }
    return def;
}

That key value comes from hashtable_key_from_2_strings which allocated the key with char *key = PyMem_RawMalloc(size);. Everything except the PyMem_RawFree call is inside the lock, and the binary data at the address specified by key is as below: the _codecs:_codecs line is the data at key, the other lines show the surrounding hex data:

28 00 00 00 00 00 00 00 9d 6e 00 00 fd fd fd fd
00 00 00 00 00 00 00 10 72 fd fd fd fd fd fd fd
_codecs:_codecs
fd fd fd fd fd fd fd fd fd fd fd fd 2e 00 2a 00

Everything except the PyMem_RawFree happened inside the extension lock, which is

PyThread_acquire_lock(_PyRuntime.imports.extensions.mutex, WAIT_LOCK);

and

PyThread_release_lock(_PyRuntime.imports.extensions.mutex);

i.e., it's global, not per-interpreter.

Oh, erk. Looking at the stack trace, it ends with:

<crt stuff...>
python312_d.dll!_PyMem_RawFree(void * _unused_ctx, void * ptr) Line 73 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:73)
python312_d.dll!PyMem_RawFree(void * ptr) Line 685 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Objects\obmalloc.c:685)
python312_d.dll!_extensions_cache_get(_object * filename, _object * name) Line 988 (c:\Users\paulh\OneDrive\Documents\Python\cpython\Python\import.c:988)

but that's wrong. PyMem_RawFree is supposed to call _PyMem_DebugRawFree, not _PyMem_RawFree. _PyMem_RawFree is literally just free, but _PyMem_DebugRawFree first walks backwards two words, calls _PyMem_DebugCheckAddress and passes that new address to PyMem_RawFree.

And we can see in the memory dump that the _PyMem_DebugMalloc tracking data is correctly present (prefix big-ending size_t 0x10, then r, then 3 0xfd; suffix is 8 0xfd's). You can then see the MSVC debug heap guards around that (4 0xfd's either side, and I assume that 0x28 at the start is the MSVC-visible allocation size: 0x10 for the Python data plus 3 size_t's for the _PyMem_DebugMalloc tracking data)

So in short, I think the underlying issue is that something is causing PyMem_RawFree to call _PyMem_RawFree instead of _PyMem_DebugRawFree and that's corrupting the heap.

#define _PyMem_Raw (_PyRuntime.allocators.standard.raw)
//...
void PyMem_RawFree(void *ptr)
{
    _PyMem_Raw.free(_PyMem_Raw.ctx, ptr);
}

I confirmed with a watch in VSCode that _PyRuntime.allocators.standard.raw.free is 0x00007fff01e32ca7 {python312_d.dll!_PyMem_DebugRawFree}, so the right version of _pymem_allocators_standard_INIT appears to be being executed (and clearly at allocation time we got the right allocator, but I'm wondering if a bug in another thread temporarily installed the non-debug allocator.

A quick test shows alloc_for_runtime does indeed install the non-debug allocators temporarily but that's before our threads exist. Same for _clear_preinit_entries, and _Py_SetArgcArgv.

I wasn't able to catch anything changing allocators during runtime, but I did reproduce the other version of the failure, when _PyMem_DebugRawFree sees bad data. And sure enough, right before the pointer being free'd, there's just four 0xfd, which is the MSVC guard-rail, meaning this block was allocated by _PyMem_RawMalloc (i.e. direct call to malloc), and not by PyMem_DebugRawMalloc, which would have put two words of tracking data immediately before the pointer.

So yeah, I reckon that's the issue, and my best guess is that something in new_interpreter is temporarily switching allocators.

(This might have been easier to spot earlier if Python and MSVC weren't using the same byte value (0xfd) for their guard value: What I earlier thought was an overwrite of the Python memory tracking data was actually the MSVC tracking data: LE size_t of the allocation length, 16-bits request number, and four 0xfd's).

tl;dr: It seems that something called under new_interpreter is installing the (global) non-debug allocators temporarily, while other threads are running, and this causes some allocations to be allocated or free'd with the non-debug allocator, mismatching with the other operation in the debug allocator.

So this only affects debug builds (although using a custom allocator would also reproduce this in release builds, I guess), but should also happen on Linux AFAICT, but would need a debug allocator like MSVC's to actually catch it. (There might be a confounding factor, like the faulty code is Windows-only, I haven't tracked that down.)

Edit: Got it. _Py_ClearStandardStreamEncoding called from new_interpreter -> init_interp_main -> init_sys_streams is the culprit in 3.12. However, that was removed in 3.13 by #105154, so I guess something else must be calling _PyMem_SetDefaultAllocator inside new_interpreter to reproduce this in 3.13 onwards.

Last edit for the night: I had a quick go with 3.13.0rc1, and the same code and config hangs because the import machinery now tries to switch to the main thread in import_run_extension, and everyone's waiting for the GIL. The main thread is of course waiting in std::thread::join in the vector destructor.

Making this change in main() fixes the hang:

    Py_BEGIN_ALLOW_THREADS;
    execute();
    Py_END_ALLOW_THREADS;

And now it doesn't seem to reproduce.

Since the problematic code was removed in 3.13, I suspect the 3.13/HEAD repro case seen in #123134 (comment) is only when the GIL is disabled, and is either some GIL-disabled-only codepath triggering the same allocator/mismatch in other threads, or some other GIL-disabled bug with the same/similar symptom.

I tried adding set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON") before the find_package call in CMake and target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1) at the end of the CMakeLists.txt, and now all the threads have hung in _PyEval_EnableGILTransient, the call before import_run_extension which is only in the non-GIL builds, waiting on a semaphore, so I think I'd need to better-understand how GIL-disabled works before I can track down the 3.13 repro case mentioned above.

This is what my repro-attempt for 3.13 looks like now

cmake_minimum_required(VERSION 3.30)
project(463-interpreters LANGUAGES C CXX)

set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON")
find_package(Python 3.13 REQUIRED COMPONENTS Development.Embed)

set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")

add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)

# find_package ought to do this when free-threaded support is requested.
target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1)

#include <cstdlib>
#include <print>
#include <thread>
#include <vector>

#include <Python.h>

// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
    return size_t(n);
}

namespace
{

static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
    .use_main_obmalloc = 0,
    .allow_fork = 0,
    .allow_exec = 0,
    .allow_threads = 0,
    .allow_daemon_threads = 0,
    .check_multi_interp_extensions = 1,
    .gil = PyInterpreterConfig_OWN_GIL,
};

} /* nameless namespace */

void execute()
{
    std::vector<std::jthread> tasks {};
    tasks.reserve(MAX_STATES);
    for (auto count = 0_uz; count < tasks.capacity(); count++)
    {
        std::println("Generating thread state {}", count);
        tasks.emplace_back(
            [count]
            {
                if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
                {
                    std::println("Failed to initialize thread state {}", count);
                    return;
                }
                auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
                auto globals = PyDict_New();
                auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
                auto result = PyEval_EvalCode(code, globals, globals);
                Py_DecRef(result);
                Py_DecRef(code);
                Py_DecRef(globals);
                Py_EndInterpreter(state);
                state = nullptr;
            });
    }
}

int main()
{
    PyConfig config {};
    PyConfig_InitIsolatedConfig(&config);
    if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
    {
        std::println("Failed to initialize with isolated config: {}", status.err_msg);
        return EXIT_FAILURE;
    }
    PyConfig_Clear(&config);
    Py_BEGIN_ALLOW_THREADS;
    execute();
    Py_END_ALLOW_THREADS;
    Py_Finalize();
}

ZeroIntensity · 2024-08-29T17:27:41Z

Unrelated, but I cannot reproduce the data race in qsort under TSan. Instead, TSan reports a segfault, but doesn't say where -- running under a debugger says the segfault is somewhere inside TSan itself, so I suspect there's something fishy happening on that front.

TBBle · 2024-08-30T11:30:13Z

I may actually just have been too impatient last night with the free-threading test.

If I wait, it seems to eventually complete. There's a bit of a bottleneck in importing: create_builtin->_PyEval_EnableGILTransient->_PyEval_StopTheWorldAll needs to block every thread for every builtin module import in any thread, and this seems to scale badly as more threads complete their initialisation, but does eventually complete AFAICT. (Debug MSVC concurrency primitives are probably not helping performance here either...)

This world-stoppage has the effect of serialising all the new_interpreter calls immediately after their GIL is created, even with a private GIL, since their first action under the new GIL is to import _imp via pycore_interp_init->_PyImport_InitCore which eventually hits create_builtin as above.

There's a note in _PyEval_EnableGILTransient (only using by the import mechanism):

    // This could be an interpreter-local stop-the-world in situations where we
    // know that this interpreter's GIL is not shared, and that it won't become
    // shared before the stop-the-world begins. For now, we always stop all
    // interpreters for simplicity.
    _PyEval_StopTheWorldAll(&_PyRuntime);

so maybe this will be improved in future, since at least during new_interpreter, I think the above conditions hold?

In the real world, this would be a thread-startup cost, and new interpreters already have to call _PyEval_StopTheWorldAll at startup and finalisation to manage the "list of all threads", so it's not awful.

So at this point, I think this issue's original bug is actually "resolved-by-accident in 3.13", pending a new repro case. I'm not sure if it's interesting to fix in 3.12; it only affects debug builds but it is a potential memory-corruption issue. Since the functions were removed in 3.13 and previously deprecated, simply gutting them in 3.12 and complaining loudly if they are called would be viable, perhaps.

The qsort issue should be reproducible in 3.13 but it's quite possible that it has no actual effect despite being detectible by TSan, beyond wasting time with repeated qsort calls on already-sorted data. It depends on how well or badly qsort handles being run on the same array in multiple threads.

bruxisma · 2024-08-31T01:31:40Z

@TBBle Small note btw, per your comment and the find_package support for Py_GIL_DISABLED=1, but CMake 3.30.3 was released in the last few days and has this as a fix now. 🫡

bruxisma added the type-bug An unexpected behavior, bug, or error label Aug 19, 2024

picnixz added the topic-subinterpreters label Aug 19, 2024

picnixz added the OS-windows label Aug 20, 2024

ZeroIntensity mentioned this issue Aug 30, 2024

Py_NewInterpreterFromConfig end with hard crash #123488

Open

Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134

Comments

bruxisma commented Aug 19, 2024 • edited by github-actions bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

picnixz commented Aug 19, 2024

ZeroIntensity commented Aug 19, 2024

bruxisma commented Aug 19, 2024

bruxisma commented Aug 19, 2024

ZeroIntensity commented Aug 19, 2024

bruxisma commented Aug 19, 2024

ZeroIntensity commented Aug 19, 2024

bruxisma commented Aug 20, 2024

bruxisma commented Aug 20, 2024

TBBle commented Aug 20, 2024

ZeroIntensity commented Aug 20, 2024

TBBle commented Aug 20, 2024 • edited Loading

ZeroIntensity commented Aug 20, 2024

bruxisma commented Aug 20, 2024

ZeroIntensity commented Aug 20, 2024

bruxisma commented Aug 24, 2024

bruxisma commented Aug 25, 2024

ZeroIntensity commented Aug 25, 2024

bruxisma commented Aug 25, 2024

ZeroIntensity commented Aug 25, 2024 • edited Loading

bruxisma commented Aug 26, 2024

ZeroIntensity commented Aug 26, 2024

bruxisma commented Aug 26, 2024

ZeroIntensity commented Aug 26, 2024

bruxisma commented Aug 26, 2024

ZeroIntensity commented Aug 26, 2024

TBBle commented Aug 26, 2024

ZeroIntensity commented Aug 26, 2024

TBBle commented Aug 26, 2024

TBBle commented Aug 26, 2024

TBBle commented Aug 27, 2024 • edited Loading

ZeroIntensity commented Aug 27, 2024

TBBle commented Aug 27, 2024

ZeroIntensity commented Aug 27, 2024

TBBle commented Aug 29, 2024 • edited Loading

ZeroIntensity commented Aug 29, 2024

TBBle commented Aug 30, 2024

bruxisma commented Aug 31, 2024

bruxisma commented Aug 19, 2024 •

edited by github-actions bot

Loading

TBBle commented Aug 20, 2024 •

edited

Loading

ZeroIntensity commented Aug 25, 2024 •

edited

Loading

TBBle commented Aug 27, 2024 •

edited

Loading

TBBle commented Aug 29, 2024 •

edited

Loading