-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: improve AArch64 code generation #119726
Comments
Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?
Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.
I'd break this up into a couple of phases:
Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice. |
I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four. The code will be something like that:
Of course :) |
When emitting AArch64 trampolines at the end of every data stencil, re-use existent ones fot the same symbol. Fix the disassebly to reflect the "bl" instruction without the relocation.
Replace AArch64 trampolines with LDR of a PC relative literal. It saves 8 bytes in code size per trampoline and decreases the number of patches functions from 4 to 1 per stencil. It decreases by 17% the size of the stencil header file generated.
Emit AArch64 trampolines in the data section (instead of the code) of the stencil. In many cases this allows the branch to the next micro-op at the end of the stencil to be replaced with a fall-through NOP.
The plan to implement the task "Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline." Instead of generating trampolines in the data section of the stencil, generate a call to a function to generate trampolines (like patch_aarch64_trampoline).
where:
In
The function will contain the code of the trampoline
Now that the trampolines are not in the stencils anymore, they will be generated and patched at runtime. As first implementation we could assume the worst case scenario and allocate enough memory to store all the possible trampolines. Currently we emit 101 different trampolines across all the stencils, so we could allocate memory to store 150 trampolines. This allows some room for increasing the number of trampolines emitted by the JIT without incurring in
In order to improve the allocation above, we could even emit exactly the correct number of trampolines. We can detect what's needed when building |
For completeness a comment from Brandt:
|
AArch64 trampolines are now generated at runtime at the end of every trace.
The PR to generate trampolines at runtime has been created: #123872 The approach to allocate memory is the optimal one as we count how many trampolines we need for every trace and we allocate the right amount of memory for these trampolines. |
AArch64 trampolines are now generated at runtime at the end of every trace.
AArch64 trampolines are now generated at runtime at the end of every trace.
AArch64 trampolines are now generated at runtime at the end of every trace.
AArch64 trampolines are now generated at runtime at the end of every trace.
AArch64 trampolines are now generated at runtime at the end of every trace.
AArch64 trampolines are now generated at runtime at the end of every trace.
Feature or enhancement
Proposal:
This is really a follow up of #115802 and more focused on the AArch64 improvements of the code generated for the JIT.
This has been discussed with @brandtbucher during PyCon 2024.
There are a series of incremental improvements that we could implement when generating AArch64 code:
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
This has been discussed broadly at PyCon 2024 in person.
Linked PRs
The text was updated successfully, but these errors were encountered: