Description
Feature or enhancement
Proposal:
The issue #116017 explains already what the problem is with memory allocation used by the JIT.
To give more data point, I decided to debug this a little bit further, put some debugging info in the _PyJIT_Compile
and then ran a pyperformance run.
The debugging info are around the memory allocated and the padding used to align it to the page size.
The function has been called 1288249 times and this is the ratio between the actual memory allocated and the padding due to 16K (on MacOS) page size:
- Total Padding size: 16,490,764,792
- Total Code/Data size: 6,737,241,608
71% of the memory allocated is wasted in padding whilst only 29% is being used by data. There is an indication that memory needed for these objects is usually much smaller than the page size.
This is a brain dump from @brandtbucher to help out with the implementation:
for 3.14 we'll probably need to look into some sort of slab allocator that will let us share pages between executors. We can allocate by either batching the compiles or stopping the world to flip the permission bits, and then deallocate by maintaining refcounts of each page or something. [...]
One benefit that could come with an arena allocator is the ability to JIT a bunch of guaranteed-in-range trampolines for long jumps to library/C-API calls, rather than needing to create a ton of redundant in-line trampolines inline in the trace (or using global offset table hacks). That should save us memory and speed things up, I think.
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
This has been discussed with Brandt via email and in person at PyCon 2024.