-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pymalloc is not aware of Memory Tagging Extension (MTE) and crashes #87759
Comments
When Memory Tagging Extension (MTE) 0 is enabled on aarch64, python malloc make programs to crash. |
Is there any runtime option (env variable or something else) to use glibc malloc instead of pymalloc? Or the choice made at compile time is immutable? |
pymalloc is a compile-time option. The configure flag sets or unsets WITH_PYMALLOC. The define is then used by https://github.com/python/cpython/blob/master/Objects/obmalloc.c to change the internal allocator. The flag may also affect the ABI of Python and affect binary wheels. It might break compatibility with manylinux wheels for aarch64. Victor, please take a look. |
Would you mind to elaborate what does crash exactly? Can you please test if #14474 fix your issue? |
A workaround is to disable pymalloc at runtime using PYTHONMALLOC=malloc environment variable. But it makes Python 10-20% slower. |
With PYTHONMALLOC=malloc, gdb is not crashing anymore. Thanks for the workaround. pymalloc is not aware of MTE, so a SegFault occurs on any access to the memory since the check fails. |
Can you please test if #14474 fix your issue? |
I've merged PR 14474 so you can just test with an up-to-date "main" branch and see if that fixes the problem. I would expect it should fix the problem since with the radix tree arena tracking, no memory unsanitary behaviour remains. |
I'm skeptical ;-) If MTE is actually being used, system software assigns "random" values to 4 of the higher-order bits. When obmalloc punts to the system malloc, presumably those bits will be randomized in the addresses returned by malloc. Then it's just not possible that obmalloc's
can always succeed - we're insisting there that all the high-order bits are exactly the same as in the But, of course, that failure would only be seen in a debug build. |
Oh, interesting. Two ideas about handling that: we could change our assertion check to be different on ARM platforms that we know have a certain size physical address space. Probably just turn off that high-bits check. Second idea, we could change the radix tree to not assume high address bits are unused. That's trickier to do without performance or memory usage degradations. I have a work-in-progress patch that adds a cache on top of the radix tree lookup. It looks like that cache can be made to have a pretty high hit rate. Based on a small amount of testing, the radix tree lookup for address_in_range() only happens about 1% of the time. If that approach works, we could add another layer to the tree and handle the full 64-bit address space. Based on my wip testing, my benchmark was showing about equal performance with the cache to without. So, no benefit to offset the increase in code complexity. Handling the MTE high bits tricks might enough to justify the cache addition. |
Can't really know without a system to try it on, but my best guess is that these asserts are the only thing that will fail with tagging enabled. The obvious "fix" is indeed just to skip them on a platform with tagging enabled. They're meant as a sanity check that only 48 bits are really used for addresses. Which remains true even when tagging is enabled - the tag bits are part of the _pointer_ on AMD, but not part of the address. Taking tagging seriously instead would be a significant project, relying on platform-specific instructions. For a start, obmalloc would have to generate a random 4-bit integer for each object it returns, plug that into 4 specific "high order" bits of the pointer it returns, and tell the OS to associate those 4 bits with each 16-byte chunk of the object's space. mmap()-like calls would also need to be changed, to tell the OS to enable tag checking on the memory slice returned. While caching may or may not speed things up, I'm not seeing how it _could_ help move to 64-bit addresses. As is, the tree needs 8 bytes of bottom-node space for each arena in use, and that's independent of how many address bits there are (it only depends on the arena granularity). I think that could be cut by a factor of 4 by keeping track of arena pool (instead of byte) boundaries in the bottom nodes, meaning that, with the _current_ settings, we'd only need to keep track of the 64=2^6 possible pool addresses in an arena, instead of the 2^20 possible byte addresses. 6 bits fits fine in a signed 8-bit int (but we need a 32-bit int now to hold the 2^20 possible byte addresses in an arena). So the clearest way to ease the space burden of keeping track of truly expansive address spaces is to boost arena size. And, if the tree bottom changed to keep track of pool (instead of byte) addresses, possibly boost pool size too. |
BTW, your cache WIP https://github.com/python/cpython/pull/25130/files partly moves to tracking pool (instead of byte) addresses, but any such attempt faces a subtlety: it's not necessarily the case that a pool is entirely "owned" by obmalloc or by the system. It can be split. To be concrete, picture a decimal machine with arenas at 10,000 byte boundaries and pools at 100-byte boundaries, with the system malloc returning 20-byte aligned addresses. If obmalloc gets an arena starting at address 20,020, the pool range(20000, 20100) has its first 20 bytes owned by the system, but its last 80 bytes owned by obmalloc. Pass 20050 to the PR's address_in_range, and it will say it's owned by the system (because its _pool_ address, 20000, is). But it's actually owned by obmalloc. I'm not sure it matters, but it's easy to get a headache trying to out-think it ;-) In that case, obmalloc simply ignores the partial pool at the start, and the first address it can pass out is 20100. So it would never be valid for free() or realloc() to get 20050 as an input. On the other end, the arena would end at byte 20020 + 10000 - 1 = 30019. This seems worse! If 30040 were passed in, that's a system address, but its _pool_ address is 30000, which obmalloc does control. That reminds me now why the current scheme tracks byte addresses instead ;-) It appears it would require more tricks to deal correctly in all cases when system-supplied arenas aren't necessarily aligned to pool addresses (which was never a consideration in the old scheme, since a pool was at largest a system page, and all mmap()-like functions (used to get an arena) in real life return an address at worst page-aligned). Or we'd need to find ways to force mmap() (across different systems' spellings) to return a pool-aligned address for arenas to begin with. |
I think it's time to change what address_in_range() tries to answer. It currently gives a precise answer to "is this byte address in a region obmalloc owns?". But that's stronger than what it needs to do its job: the real question is "is this an address that obmalloc could return from its malloc() or realloc() functions?". Because it's only used by free() and realloc(), and they only care where the address they're _passed_ came from. The difference is when an arena is not pool-aligned. Oddball chunks outside of full pools, at both ends of the arena, are never returned by obmalloc then. Instead the tree could be changed to record the starting addresses of the _full_ pools obmalloc controls. Those contain the only addresses obmalloc will ever pass out (from malloc() or realloc()). Like so, where p is the arena base address. This hasn't been tested, and may contain typos or logical errors. I'm far more interested in the latter for now ;-) ideal1 = p & ~ARENA_SIZE_MASK # round down to ideal
ideal2 = ideal1 + ARENA_SIZE
offset = p - ideal1
b1 = bottom_node(ideal1)
b2 = bottom_node(ideal2)
if not offset:
# already ideal
b1.hi = -1
assert b2.lo == 0
elif offset & POOL_SIZE_MASK == 0:
# arena is pool-aligned
b1.hi = b2.lo = offset
else:
# Not pool-aligned.
# obmalloc won't pass out an address higher than in
# the last full pool.
# Round offset down to next lower pool address.
offset &= ~POOL_SIZE_MASK
b2.lo = offset
# And won't pass out an address lower than in the
# first full pool.
offset += POOL_SIZE # same as rounding original offset up
# That's almost right for b1.hi, but one exception: if
# offset is ARENA_SIZE now, the first full pool is at the
# start of ideal2, and no feasible address is in ideal1.
assert offset <= ARENA_SIZE
b1.hi = offset & ARENA_SIZE_MASK Note that, except for the oddball -1, everything stored in a bottom node is a pool address then, and so there's no need to store their all-zero lowest POOL_BITS bits. .lo and .hi can be shrunk to a signed 8-bit int with the current settings (20 arena bits and 14 pool bits). And caching pool addresses wouldn't have any obscure failing end-cases either: address_in_range() could just as well be passed a pool address to begin with. It would only care about pool addresses, not byte addresses. |
Hello everyone, this issue picked my interest and decided to investigate what's going on. TLDR; Longer version: With Python 3.9.18 you have this behaviour:
Next thing is to see what's going on with gdb. I know the
Looking at the GDB output we see that all the addresses that are passed to It always fails after six times The issue is when we are going to check the address with the tag with With the radix tree implementation, we never dereference the pointer passed to address_in_range or any pointer derived from it so no MTE violation can occur. I strongly suggest to back port the radix tree implementation to 3.9 as it will be supported until 2025. MTE is a security feature (https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/enhanced-security-through-mte) on Arm platforms to protect the user from invalid memory accesses. @ambv is the release manager of 3.9 hence tagging him for visibility. |
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Jira: ENTLLT-7122 Change-Id: Ifbc8c1290077b78c85ac30abe0bcbb7c8ea0b959
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Change-Id: I0a3c2979c207f997c707c5f798941426c8d50004
Hello, after backporting the implementation of the radix tree to 3.9 branch, I've received a some feedback highlighting that this is not a security fix and not even a bug fix because it is actually a new feature (thanks @pablogsal and @nascheme). In the last couple of days I've given some extra thoughts to the issue/PR and I've concluded that the best course of action is not to merge this (and subsequent fixes) and leave as it is. This because of the benefits that we have from this PR are very little compared to the cost of integrating it: it is not just merging this PR but (as @nascheme said in the PR) all subsequent fixes. Even though we do it, a patch release of 3.9 needs to be created and it eventually will be propagated to distros. When the distros will pick up the update, those will be old LTS where glibc anyway is not MTE ready so almost impossible to use it. glibc support for MTE has been introduced in 2.32. Ubuntu 22.04 has glibc 2.35 but MTE is disabled by default. We need to update to 23.04 with glibc 2.37 to have MTE enabled by default in glibc. In ubuntu 23.04 the default Python version is 3.11. It is very unlikely to have a case where a fairly recent distribution is used with an older version of Python on it. At that point if the user really needs MTE, we can simply suggest to use a newer version of Python. If we really want to do anything here, we could just update the 3.9 documentation by acknowledging that Python 3.9 is not compatible with MTE on aarch64 and the workaround is to use a more updated version of Python. Thoughts? |
I suggest closing the issue as it has been decided not to progress with the backport to 3.9 anymore. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: