-
-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating an ungodly amount of sub interpreters in a short amount of time causes memory debug assertions. #123134
Comments
Can you reproduce this behaviour on Linux? or MacOS? Or what happens if you sleep a bit before spawning the sub-interpreters? I'm not an expert in threads but I'd say that interpreters are spawning too fast for the OS to follow up =/ Can you check whether the same occurs with HEAD instead of 3.12.5? (maybe there are some bugfixes that are not yet available) |
At a glance, this could possibly be related to a memory allocation failure or something similar. auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals); These lines need a |
|
Sleeping might fix it, but might also hide the issue. The goal here was just to run a bunch of interpreters, not sleep for the sake of working around possible synchronization issues
This is not how threads or operating systems work. There is a synchronization issue or an ABA problem with the debug memory assertions in the C API's internals
I will also work on giving this a go. |
Oh! TIL. I still would suggest adding a |
The error is outside of the Python C API as one might usually interact with it. i.e., only on subinterpreter initialization (i.e., creating a |
Right, but it could cause memory corruption somewhere, and cause the error somewhere else down the line. Regardless, I wasn't able to reproduce this on Linux, so this might be Windows-specific. |
@picnixz I can confirm the behavior still exists on Windows on
|
Pleased to back up @ZeroIntensity and say that this appears to be a Windows specific issue. I was unable to replicate this memory assertion error on Linux. However, when executing this code under TSan, I got the following data race error that seems to indicate the same kind of memory corruption COULD be taking place. WARNING: ThreadSanitizer: data race (pid=11259)
Write of size 8 at 0x7f24e35e16c0 by thread T10:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
...
Previous write of size 8 at 0x7f24e35e16c0 by thread T18:
#0 qsort_r <null> (463-interpreters+0x9bc2e) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#1 qsort <null> (463-interpreters+0x9bec7) (BuildId: 4e94833395fedac62b3c4caa6921ff713e25cff1)
#2 setup_confname_table /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13543:5 (libpython3.12.so.1.0+0x3b6946) (BuildId: e2e0fbc52bf8cb7daf99595e16454e44217040f4)
#3 setup_confname_tables /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:13572:9 (libpython3.12.so.1.0+0x3b6946)
#4 posixmodule_exec /usr/src/python3.12-3.12.3-1ubuntu0.1/build-shared/../Modules/posixmodule.c:16905:9 (libpython3.12.so.1.0+0x3b6946)
Location is global '??' at 0x7f24e2e85000 (libpython3.12.so.1.0+0x75c880) This only happened 3 times, and around the 42 thread generation step on average. |
A quick look suggests this TSan failure is a local data-race to posixmodule.c: There appears to be no protection from multiple threads importing posixmodule simultaneously, and hence multiple threads trying to simultaneously qsort in-place one of the three static tables. This module is used on Windows, but I'm not sure if any of the macros enabling those I don't know if overlapping qsorts on the same array will actually corrupt memory, but I doubt there's any guarantee it won't. ( So probably a separate issue, and a more-focused/reliable MRE could be made for it. |
To think of it, I'm not even sure subinterpreters are thread safe in the first place, you might need to acquire the GIL. Though, I'm not all too familiar with them. cc @ericsnowcurrently for insight |
I think the point of PEP-684 (in Python 3.12) is that you can have a per-subinterpreter GIL, as is being done in this test-case. posixmodule.c claims compatibility with the per-interpreter GIL, so in that case I think it's a straight-up bug. The Edit: Oops, perhaps I misunderstood your comment, and it wasn't directed as the posixmodule issue. Indeed, calling
So I'm not clear if the given example is doing the right thing here or not, but I suspect the first thing the tasks should do is take the main interpreter's GIL if not already holding it. (And the call to |
I'm guessing that's the problem Judging by the comment, it looks like Lines 2261 to 2263 in bffed80
I could see this being a bug, though. I'll wait for Eric to clarify. |
Well this definitely brings me some concern then, as the GIL can be made optional in 3.13 😅 |
Does this occur on the free threaded build? |
Sorry for the delay. Had a busy work week. I'll gather this info this weekend, though the code seems to indicate "yes" from a glance. |
@ZeroIntensity I can confirm the issue still happens when the GIL is disabled in 3.13 and 3.14 HEAD |
For now, I think I'm going to write this off as user-error, as |
How can those work if the GIL is disabled (and thus can't be engaged) on 3.13 and later, and also each new interpreter is having its own GIL created? You cannot call |
AFAIK, At a glance, the drop-in code would be: void execute () {
std::vector<std::jthread> tasks { };
tasks.reserve(MAX_STATES);
for (auto count = 0zu; count < tasks.capacity(); count++) {
std::println("Generating thread state {}", count);
tasks.emplace_back([count] {
PyGILState_STATE gil = PyGILState_Ensure(); // Make sure we can call the API
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status)) {
std::println("Failed to initialize thread state {}", count);
PyGILState_Release(gil);
return;
}
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
PyGILState_Release(gil); // Release the GIL (only on non-free-threaded builds)
});
}
} There needs to be a |
It shouldn't require any of that. The interpreter variable is local to the thread. The The need for
(emphasis mine) So, either the documentation is wrong, or Python's C API is wrong. Either way, there is a bug here. |
Yes, but how do you initialize that interpreter variable? I don't think I think you're misinterpreting what I'm saying: you're right, you don't need to mess with the GIL when using the subinterpreter, but it seems (based on the code from |
yes it is initialized. that's what I'm very certain this is to do with memory corruption and memory blocks within the debug assertions not being locked or cleared out properly which is leading to this invalid state occurring. Why it only happens on Windows most likely has to do with behavior regarding the MSVC runtime in some capacity, and some behavioral difference with windows memory pages and the virtual allocator versus posix platforms surfacing. That's the best I've been able to intuit, as the crash is indeterminate in nature, though as I stated in an earlier comment |
It would probably segfault instead, FWIW -- C API calls don't have an extra check for the interpreter or thread state. I don't see where in Speculatively, the race could be caused by here Line 2251 in bffed80
Trying to set this across multiple threads could be problematic, since there's no lock. |
If any of those failures were happening there, this same behavior would be observable in a release build. This situation only happens when the |
The whole issue overall screams data race to me, it's quite odd that this only occurs on Windows debug builds. I would guess that this is present in release builds as well, just maybe not with this reproducer. Regardless, I'm like 70% sure that a lock should be acquired when writing to the runtime state, through |
Having a quick poke around (in 3.13 on GitHub): Basically, the things that So I don't think the original code breaks the actual preconditions, at least that I noticed in browsing, and hence the docs that suggest a GIL must exist for calls to The comment "but mixing multiple interpreters and the
The converse of that last paragraph is that #59956 (comment) seems to agree with the above, as it notes that That said, adding |
After some debugging, it seems I was wrong! |
Looking at the memory dump data from Instead it looks like we got overwritten with a 32-bit little-endian 0x00000000000003e8 (1000) followed by 16-bit 0x0000a70 (2672). (That first one could also be 16-bit 1000 followed by 16-bit 0, no way to tell. We don't see the preceding bytes either. They don't look like ASCII to me, anyway.) The 1000 is nice and round, but neither one jumps out with any idea of what might be overwriting that data. So no obvious clue as to whether this only happens in the debug build, or if it's only detected in the debug build. In release builds there's no bookkeeping in live blocks or release-time validity checking, so if the overwrite is happening in non-debug builds we may never see evidence for it. The first thing in a PyObject is the ssize_t ref count, so in release-time builds, an overwrite like this would prevent the overwritten object from being deallocated, but not otherwise mess with anything off-hand. (GIL-disabled builds look like they would be more-damaged by overwriting the first Since TSAN isn't flagging this, I suspect these are overlapping objects from the same thread: an object is either being allocated from the wrong pool, or is writing past its end-bounds and hitting the next object in the pool. (Perhaps the damaged object is raw-alloc'd? I don't know if I'd expect it to be 16-bit aligned, but maybe I'm wrong to assume a Anyway, I expect if you can catch the failure in a debugger, you could walk backwards a bit in memory and find the actual object causing the problem, if that is a The preceding object probably isn't a Which leans me harder towards "one of the obmalloc tracking structures on the heap blows out its bounds and corrupts the first-allocated PyObject from the first-created pool, which immediately follows it". That points the finger at So maybe a dead-end after all. Is it possible to reproduce this failure with |
Wait, do you have a repro-case on Linux? Or were you looking at the TSan failure? (If the latter, put a critical section of some kind around the qsort calls and it should be fixed. Not the best fix, but at least proves or disproves the point.) |
I reproduced using the MSVC++ 17.11 Community edition compiler (under VS Code) and the clearest failure is
from a Anyway, what I guess is going on is that the object is being free'd in the wrong thread, so when The free'd object is generally on-the-way-out, in the list and dict resize operations it was the "oldXXX" pointer. I expect that the reason I just now had a Edit: Thinking about this, I realised (and tested) that the repro does not require any calls between Edit again (last one tonight): I think the following is the most-isolated example of a failure case: marshal.c:1830 (Python 3.12) if ((rf.refs = PyList_New(0)) == NULL)
return NULL;
result = read_object(&rf);
Py_DECREF(rf.refs); The file itself is 44kb, in case it turns out to be something about a particular file. Although I hit it again with a smaller buffer (and smaller object being freed) so probably not. Inside a block of C code, we create a new list, stash it somewhere not visible from anywhere else, pass it into a function, and then free it. During the free, I'm seeing the list had 465 items in it, and the failure came when decrefing item 128 (going from the back). I don't know what type that item had, as its data was already blatted by the raw-free logic after passing Given the size, it would have not been in the pool allocator but should have fallen through to heap allocation, so I'm now pretty doubtful about my earlier "freeing in wrong obmalloc" theory, and unsure what exactly That said, since this is an unmarshall, I believe it can reference things that already exist, so maybe it's making itself visible to another thread by this mechanism? |
I'm wondering if the Linux and Windows parts are separate issues. The data race occurs on just two threads and seems to happen in |
@ZeroIntensity |
I'll create a separate issue later today then, thanks! |
Okay, I have a more-isolated example which points to maybe there just being a nasty bug in the debug allocator support somewhere. I just got the static PyModuleDef *
_extensions_cache_get(PyObject *filename, PyObject *name)
{
PyModuleDef *def = NULL;
void *key = NULL;
extensions_lock_acquire();
if (EXTENSIONS.hashtable == NULL) {
goto finally;
}
key = hashtable_key_from_2_strings(filename, name, HTSEP);
if (key == NULL) {
goto finally;
}
_Py_hashtable_entry_t *entry = _Py_hashtable_get_entry(
EXTENSIONS.hashtable, key);
if (entry == NULL) {
goto finally;
}
def = (PyModuleDef *)entry->value;
finally:
extensions_lock_release();
if (key != NULL) {
PyMem_RawFree(key);
}
return def;
} That
Everything except the
and
i.e., it's global, not per-interpreter. Oh, erk. Looking at the stack trace, it ends with:
but that's wrong. And we can see in the memory dump that the So in short, I think the underlying issue is that something is causing #define _PyMem_Raw (_PyRuntime.allocators.standard.raw)
//...
void PyMem_RawFree(void *ptr)
{
_PyMem_Raw.free(_PyMem_Raw.ctx, ptr);
} I confirmed with a watch in VSCode that A quick test shows I wasn't able to catch anything changing allocators during runtime, but I did reproduce the other version of the failure, when So yeah, I reckon that's the issue, and my best guess is that something in (This might have been easier to spot earlier if Python and MSVC weren't using the same byte value (0xfd) for their guard value: What I earlier thought was an overwrite of the Python memory tracking data was actually the MSVC tracking data: LE size_t of the allocation length, 16-bits request number, and four 0xfd's). tl;dr: It seems that something called under So this only affects debug builds (although using a custom allocator would also reproduce this in release builds, I guess), but should also happen on Linux AFAICT, but would need a debug allocator like MSVC's to actually catch it. (There might be a confounding factor, like the faulty code is Windows-only, I haven't tracked that down.) Edit: Got it. Last edit for the night: I had a quick go with 3.13.0rc1, and the same code and config hangs because the import machinery now tries to switch to the main thread in Making this change in
And now it doesn't seem to reproduce. Since the problematic code was removed in 3.13, I suspect the 3.13/HEAD repro case seen in #123134 (comment) is only when the GIL is disabled, and is either some GIL-disabled-only codepath triggering the same allocator/mismatch in other threads, or some other GIL-disabled bug with the same/similar symptom. I tried adding This is what my repro-attempt for 3.13 looks like nowcmake_minimum_required(VERSION 3.30)
project(463-interpreters LANGUAGES C CXX)
set(Python_FIND_ABI "ANY" "ANY" "ANY" "ON")
find_package(Python 3.13 REQUIRED COMPONENTS Development.Embed)
set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>")
add_executable(${PROJECT_NAME})
target_sources(${PROJECT_NAME} PRIVATE main.cxx)
target_compile_features(${PROJECT_NAME} PRIVATE cxx_std_23)
target_precompile_headers(${PROJECT_NAME} PRIVATE <Python.h>)
target_link_libraries(${PROJECT_NAME} PRIVATE Python::Python)
# find_package ought to do this when free-threaded support is requested.
target_compile_definitions(${PROJECT_NAME} PRIVATE Py_GIL_DISABLED=1) #include <cstdlib>
#include <print>
#include <thread>
#include <vector>
#include <Python.h>
// https://developercommunity.visualstudio.com/t/Please-implement-P0330R8-Literal-Suffix/10410860#T-N10539586
static constexpr size_t operator""_uz(unsigned long long n)
{
return size_t(n);
}
namespace
{
static thread_local inline PyThreadState* state = nullptr;
static inline constexpr auto MAX_STATES = 463;
static inline constexpr auto config = PyInterpreterConfig {
.use_main_obmalloc = 0,
.allow_fork = 0,
.allow_exec = 0,
.allow_threads = 0,
.allow_daemon_threads = 0,
.check_multi_interp_extensions = 1,
.gil = PyInterpreterConfig_OWN_GIL,
};
} /* nameless namespace */
void execute()
{
std::vector<std::jthread> tasks {};
tasks.reserve(MAX_STATES);
for (auto count = 0_uz; count < tasks.capacity(); count++)
{
std::println("Generating thread state {}", count);
tasks.emplace_back(
[count]
{
if (auto status = Py_NewInterpreterFromConfig(&state, &config); PyStatus_IsError(status))
{
std::println("Failed to initialize thread state {}", count);
return;
}
auto text = std::format(R"(print("Hello, world! From Thread {}"))", count);
auto globals = PyDict_New();
auto code = Py_CompileString(text.data(), __FILE__, Py_eval_input);
auto result = PyEval_EvalCode(code, globals, globals);
Py_DecRef(result);
Py_DecRef(code);
Py_DecRef(globals);
Py_EndInterpreter(state);
state = nullptr;
});
}
}
int main()
{
PyConfig config {};
PyConfig_InitIsolatedConfig(&config);
if (auto status = Py_InitializeFromConfig(&config); PyStatus_IsError(status))
{
std::println("Failed to initialize with isolated config: {}", status.err_msg);
return EXIT_FAILURE;
}
PyConfig_Clear(&config);
Py_BEGIN_ALLOW_THREADS;
execute();
Py_END_ALLOW_THREADS;
Py_Finalize();
} |
Unrelated, but I cannot reproduce the data race in |
I may actually just have been too impatient last night with the free-threading test. If I wait, it seems to eventually complete. There's a bit of a bottleneck in importing: This world-stoppage has the effect of serialising all the There's a note in
so maybe this will be improved in future, since at least during In the real world, this would be a thread-startup cost, and new interpreters already have to call So at this point, I think this issue's original bug is actually "resolved-by-accident in 3.13", pending a new repro case. I'm not sure if it's interesting to fix in 3.12; it only affects debug builds but it is a potential memory-corruption issue. Since the functions were removed in 3.13 and previously deprecated, simply gutting them in 3.12 and complaining loudly if they are called would be viable, perhaps. The qsort issue should be reproducible in 3.13 but it's quite possible that it has no actual effect despite being detectible by TSan, beyond wasting time with repeated qsort calls on already-sorted data. It depends on how well or badly qsort handles being run on the same array in multiple threads. |
@TBBle Small note btw, per your comment and the |
Bug report
Bug description:
Hello. While working on a small joke program, I found a possible memory corruption issue (it could also be a threading issue?) when using the Python C API in a debug only build to quickly create, execute python code, and then destroy 463 sub interpreters. Before I post the code sample and the debug output I'm using a somewhat unique build environment for a Windows developer.
When running the code sample I've attached at the bottom of this post, I am unable to get the exact same output each time, though the traceback does fire in the same location (Due to the size of the traceback I've not attached it, as it's about 10 MB of text for each thread). Additionally, I sometimes have to run the executable several times to get the error to occur. Lastly, release builds do not exhibit any thread crashes or issues as the debug assertions never fire or execute.
The error output seems to also halt in some cases, either because of stack failure or some other issue I was unable to determine and seemed to possibly be outside the scope of the issue presented here. I have the entire error output from one execution where I was able to save the output.
The output cut off after this, as the entire program crashed, taking my terminal with it 😅
You'll find the MRE code below. I've also added a minimal version of CMakeLists.txt file I used so anyone can recreate the build with the code below (Any warnings, or additional settings I have do not affect whether the error occurs or not). The code appears to breaks inside of
_PyObject_DebugDumpAddress
, based on what debugging I was able to do with WinDbg.Important
std::jthread
calls.join()
on destruction, so all threads auto-join once thestd::vector
goes out of scope.Additionally this code exhibits the same behavior regardless of whether it is a
thread_local
or declared within the lambda passed tostd::thread
main.cxx
CMakeLists.txt
Command to build + run
CPython versions tested on:
3.12
Operating systems tested on:
Windows
The text was updated successfully, but these errors were encountered: