Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: improve AArch64 code generation #119726

Open
4 of 5 tasks
diegorusso opened this issue May 29, 2024 · 5 comments
Open
4 of 5 tasks

JIT: improve AArch64 code generation #119726

diegorusso opened this issue May 29, 2024 · 5 comments
Assignees
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement

Comments

@diegorusso
Copy link
Contributor

diegorusso commented May 29, 2024

Feature or enhancement

Proposal:

This is really a follow up of #115802 and more focused on the AArch64 improvements of the code generated for the JIT.
This has been discussed with @brandtbucher during PyCon 2024.

There are a series of incremental improvements that we could implement when generating AArch64 code:

  • Remove duplication of trampoline section (movk) at the end of every micro op assembly code.
    // 0000000000000140:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 144: f2a00008      movk    x8, #0x0, lsl #16
    // 0000000000000144:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 148: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000148:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 14c: f2e00008      movk    x8, #0x0, lsl #48
    // 000000000000014c:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 150: d61f0100      br      x8
    // 154: 00 00 00 00
    // 158: d2800008      mov     x8, #0x0
    // 0000000000000158:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 15c: f2a00008      movk    x8, #0x0, lsl #16
    // 000000000000015c:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 160: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000160:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 164: f2e00008      movk    x8, #0x0, lsl #48
    // 0000000000000164:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 168: d61f0100      br      x8
  • Implement trampoline with LDR of a PC relative literal (instead of movk). It saves 8bytes in code size.
  • Move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
  • Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline.
  • Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed broadly at PyCon 2024 in person.

Linked PRs

@diegorusso diegorusso added the type-feature A feature request or enhancement label May 29, 2024
@mdboom mdboom added the performance Performance or resource usage label May 29, 2024
@brandtbucher brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 new features, bugs and security fixes labels May 29, 2024
@brandtbucher
Copy link
Member

Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.

I'd break this up into a couple of phases:

  • A PR to move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
  • A PR to emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses (this could waste some memory initially, but is a nice intermediate step).
  • Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.

@diegorusso
Copy link
Contributor Author

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.

The code will be something like that:

ldr x8, [PC+8]
br x8
&_Py_Dealloc

So this work should also benefit other platforms too, which is nice.

Of course :)

diegorusso added a commit to diegorusso/cpython that referenced this issue Jun 7, 2024
When emitting AArch64 trampolines at the end of every data stencil,
re-use existent ones fot the same symbol.
Fix the disassebly to reflect the "bl" instruction without the
relocation.
diegorusso added a commit to diegorusso/cpython that referenced this issue Jun 25, 2024
Replace AArch64 trampolines with LDR of a PC relative literal.
It saves 8 bytes in code size per trampoline and decreases the number
of patches functions from 4 to 1 per stencil.
It decreases by 17% the size of the stencil header file generated.
diegorusso added a commit to diegorusso/cpython that referenced this issue Jul 2, 2024
Emit AArch64 trampolines in the data section (instead of the code) of
the stencil. In many cases this allows the branch to the next micro-op
at the end of the stencil to be replaced with a fall-through NOP.
Akasurde pushed a commit to Akasurde/cpython that referenced this issue Jul 3, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
@diegorusso
Copy link
Contributor Author

diegorusso commented Aug 30, 2024

The plan to implement the task "Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline."

Instead of generating trampolines in the data section of the stencil, generate a call to a function to generate trampolines (like patch_aarch64_trampoline).
The signature of the function is something like:

void patch_trampoline(unsigned char *location, uint64_t value);

where:

  • location is the pointer to the code to patch
  • value is the address of the trampoline.

In jit.c implement such function. It will do mainly 2 things:

  • if the trampoline for that symbol is not already emitted, emit (write in memory) the trampoline and store it in a table. If a trampoline for that symbol is already emitted (it is present the table), skip this step.
  • patch the uop with the correct address of the trampoline

The function will contain the code of the trampoline

 f"{base + 4 * 0:x}: 58000048      ldr     x8, 8",
 f"{base + 4 * 1:x}: d61f0100      br      x8",
 f"{base + 4 * 2:x}: 00000000",
 f"{base + 4 * 2:016x}:  R_AARCH64_ABS64    {hole.symbol}",
 f"{base + 4 * 3:x}: 00000000",

Now that the trampolines are not in the stencils anymore, they will be generated and patched at runtime.
We need though space in memory to store them but we don't know how many we need to emit.

As first implementation we could assume the worst case scenario and allocate enough memory to store all the possible trampolines. Currently we emit 101 different trampolines across all the stencils, so we could allocate memory to store 150 trampolines. This allows some room for increasing the number of trampolines emitted by the JIT without incurring in
memory overflow.
The total size allocate by _PyJIT_Compile would be something like (POC):

#define TRAMPOLINE_SIZE (4 * 4 * 150)

...
...

size_t total_size = code_size + data_size + TRAMPOLINES_SIZE + padding;

In order to improve the allocation above, we could even emit exactly the correct number of trampolines. We can detect what's needed when building jit_stencils.h, and give the stencils metadata about what trampolines they
need (sort of like how we put their size now).
Then when we're doing that first loop in _PyJIT_Compile to get the size of the JIT code, we can also collect the trampolines that are needed at the same time.

@diegorusso
Copy link
Contributor Author

For completeness a comment from Brandt:

"Btw, I don’t think the trampoline emit func needs to take a string. We can probably just generate an enum (or #define) with integer values for each symbol, and use those. That way we don’t need to parse strings, and we could use bit-flags later when we want to only emit the needed trampolines.
Also, we can count the unique symbols when generating the templates, and use that to define the max extra space needed (instead of just hardcoding 150).
In general I think we can really rely on the stencil generator here to make things easier.
And more precise."

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 9, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
@diegorusso
Copy link
Contributor Author

The PR to generate trampolines at runtime has been created: #123872

The approach to allocate memory is the optimal one as we count how many trampolines we need for every trace and we allocate the right amount of memory for these trampolines.

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024
diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 19, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 26, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 30, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
diegorusso added a commit to diegorusso/cpython that referenced this issue Oct 2, 2024
AArch64 trampolines are now generated at runtime at
the end of every trace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-JIT type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants