JIT: improve AArch64 code generation #119726

diegorusso · 2024-05-29T13:25:56Z

Feature or enhancement

Proposal:

This is really a follow up of #115802 and more focused on the AArch64 improvements of the code generated for the JIT.
This has been discussed with @brandtbucher during PyCon 2024.

There are a series of incremental improvements that we could implement when generating AArch64 code:

Remove duplication of trampoline section (movk) at the end of every micro op assembly code.

    // 0000000000000140:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 144: f2a00008      movk    x8, #0x0, lsl #16
    // 0000000000000144:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 148: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000148:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 14c: f2e00008      movk    x8, #0x0, lsl #48
    // 000000000000014c:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 150: d61f0100      br      x8
    // 154: 00 00 00 00
    // 158: d2800008      mov     x8, #0x0
    // 0000000000000158:  R_AARCH64_MOVW_UABS_G0_NC    PyObject_Free
    // 15c: f2a00008      movk    x8, #0x0, lsl #16
    // 000000000000015c:  R_AARCH64_MOVW_UABS_G1_NC    PyObject_Free
    // 160: f2c00008      movk    x8, #0x0, lsl #32
    // 0000000000000160:  R_AARCH64_MOVW_UABS_G2_NC    PyObject_Free
    // 164: f2e00008      movk    x8, #0x0, lsl #48
    // 0000000000000164:  R_AARCH64_MOVW_UABS_G3       PyObject_Free
    // 168: d61f0100      br      x8

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves 8bytes in code size.
Move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline.
Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

This has been discussed broadly at PyCon 2024 in person.

Linked PRs

The text was updated successfully, but these errors were encountered:

brandtbucher · 2024-05-29T23:25:19Z

Thanks for organizing our thoughts on this. Okay if I assign you, since you expressed interest in working on it?

Implement trampoline with LDR of a PC relative literal (instead of movk). It saves

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

Generate trampoline at the end of the trace instead of at the end of every micro op and write a function to generate the trampoline.

I'd break this up into a couple of phases:

A PR to move the trampolines from the "code" section of a micro-op to the "data" section, so it's out-of-line.
A PR to emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses (this could waste some memory initially, but is a nice intermediate step).
Once we have a slab allocator from JIT: improve memory allocation #119730, a PR use one set of trampolines per-slab rather than per-trace.

Also worth mentioning: we'll want to move to short jumps with trampolines on all platforms, not just AArch64 (AArch64 just sort of forces our hand right now since it only lets us use short jumps). So this work should also benefit other platforms too, which is nice.

diegorusso · 2024-05-30T13:43:50Z

Interesting! Mind elaborating on this a bit more? I get that it saves memory, but I'm curious if it's expected to be faster too.

I've updated the original comment saying that it saves 8 bytes. About the speed, I think we need to measure it somehow but I would think it would be the same. The other saving is that we will do only one relocation instead of four.

The code will be something like that:

ldr x8, [PC+8]
br x8
&_Py_Dealloc

So this work should also benefit other platforms too, which is nice.

Of course :)

When emitting AArch64 trampolines at the end of every data stencil, re-use existent ones fot the same symbol. Fix the disassebly to reflect the "bl" instruction without the relocation.

)

Replace AArch64 trampolines with LDR of a PC relative literal. It saves 8 bytes in code size per trampoline and decreases the number of patches functions from 4 to 1 per stencil. It decreases by 17% the size of the stencil header file generated.

…ythonGH-120250)

Emit AArch64 trampolines in the data section (instead of the code) of the stencil. In many cases this allows the branch to the next micro-op at the end of the stencil to be replaced with a fall-through NOP.

…ythonGH-120250)

diegorusso · 2024-08-30T11:55:16Z

The plan to implement the task "Emit all of the trampolines at the end of every trace, so that each opcode doesn't need its own copy of the trampolines it uses. Also write a function to generate the trampoline."

Instead of generating trampolines in the data section of the stencil, generate a call to a function to generate trampolines (like patch_aarch64_trampoline).
The signature of the function is something like:

void patch_trampoline(unsigned char *location, uint64_t value);

where:

location is the pointer to the code to patch
value is the address of the trampoline.

In jit.c implement such function. It will do mainly 2 things:

if the trampoline for that symbol is not already emitted, emit (write in memory) the trampoline and store it in a table. If a trampoline for that symbol is already emitted (it is present the table), skip this step.
patch the uop with the correct address of the trampoline

The function will contain the code of the trampoline

 f"{base + 4 * 0:x}: 58000048      ldr     x8, 8",
 f"{base + 4 * 1:x}: d61f0100      br      x8",
 f"{base + 4 * 2:x}: 00000000",
 f"{base + 4 * 2:016x}:  R_AARCH64_ABS64    {hole.symbol}",
 f"{base + 4 * 3:x}: 00000000",

Now that the trampolines are not in the stencils anymore, they will be generated and patched at runtime.
We need though space in memory to store them but we don't know how many we need to emit.

As first implementation we could assume the worst case scenario and allocate enough memory to store all the possible trampolines. Currently we emit 101 different trampolines across all the stencils, so we could allocate memory to store 150 trampolines. This allows some room for increasing the number of trampolines emitted by the JIT without incurring in
memory overflow.
The total size allocate by _PyJIT_Compile would be something like (POC):

#define TRAMPOLINE_SIZE (4 * 4 * 150)

...
...

size_t total_size = code_size + data_size + TRAMPOLINES_SIZE + padding;

In order to improve the allocation above, we could even emit exactly the correct number of trampolines. We can detect what's needed when building jit_stencils.h, and give the stencils metadata about what trampolines they
need (sort of like how we put their size now).
Then when we're doing that first loop in _PyJIT_Compile to get the size of the JIT code, we can also collect the trampolines that are needed at the same time.

diegorusso · 2024-09-04T13:06:58Z

For completeness a comment from Brandt:

"Btw, I don’t think the trampoline emit func needs to take a string. We can probably just generate an enum (or #define) with integer values for each symbol, and use those. That way we don’t need to parse strings, and we could use bit-flags later when we want to only emit the needed trampolines.
Also, we can count the unique symbols when generating the templates, and use that to define the max extra space needed (instead of just hardcoding 150).
In general I think we can really rely on the stencil generator here to make things easier.
And more precise."

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso · 2024-09-09T16:16:15Z

The PR to generate trampolines at runtime has been created: #123872

The approach to allocate memory is the optimal one as we count how many trampolines we need for every trace and we allocate the right amount of memory for these trampolines.

AArch64 trampolines are now generated at runtime at the end of every trace.

…rampolines

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added the type-feature A feature request or enhancement label May 29, 2024

mdboom added the performance Performance or resource usage label May 29, 2024

brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 new features, bugs and security fixes labels May 29, 2024

brandtbucher assigned diegorusso May 29, 2024

AlexWaygood added the topic-JIT label May 30, 2024

bedevere-app bot mentioned this issue Jun 7, 2024

gh-119726: JIT: re-use trampolines on AArch64 #120250

Merged

brandtbucher pushed a commit that referenced this issue Jun 19, 2024

GH-119726: Deduplicate JIT trampolines for out-of-range jumps (GH-120250

a0dce37

)

bedevere-app bot mentioned this issue Jun 25, 2024

gh-119726: replace AArch64 trampolines with LDR #121001

Merged

mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024

pythonGH-119726: Deduplicate JIT trampolines for out-of-range jumps (p…

c4193b2

…ythonGH-120250)

brandtbucher pushed a commit that referenced this issue Jul 1, 2024

GH-119726: Use LDR for AArch64 trampolines (GH-121001)

9662608

bedevere-app bot mentioned this issue Jul 2, 2024

gh-119726: emit AArch64 trampolines in the data section #121280

Merged

Akasurde pushed a commit to Akasurde/cpython that referenced this issue Jul 3, 2024

pythonGH-119726: Use LDR for AArch64 trampolines (pythonGH-121001)

26a4df5

brandtbucher pushed a commit that referenced this issue Jul 3, 2024

GH-119726: Emit AArch64 trampolines out-of-line (GH-121280)

84512c0

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Deduplicate JIT trampolines for out-of-range jumps (p…

6cc7eb9

…ythonGH-120250)

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Use LDR for AArch64 trampolines (pythonGH-121001)

73cb048

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythonGH-119726: Emit AArch64 trampolines out-of-line (pythonGH-121280)

622eb1f

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024

pythonGH-119726: Deduplicate JIT trampolines for out-of-range jumps (p…

47d8c29

…ythonGH-120250)

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024

pythonGH-119726: Use LDR for AArch64 trampolines (pythonGH-121001)

07fa850

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024

pythonGH-119726: Emit AArch64 trampolines out-of-line (pythonGH-121280)

c855c70

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 9, 2024

pythongh-119726: generate and patch AArch64 trampolines

6d2ea89

AArch64 trampolines are now generated at runtime at the end of every trace.

bedevere-app bot mentioned this issue Sep 9, 2024

gh-119726: generate and patch AArch64 trampolines #123872

Merged

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024

pythongh-119726: generate and patch AArch64 trampolines

4b71e67

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024

Merge branch 'main' into pythongh-119726-generate-and-patch-AArch64-t…

a7a04fc

…rampolines

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 10, 2024

pythongh-119726: generate and patch AArch64 trampolines

5025ded

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 19, 2024

pythongh-119726: generate and patch AArch64 trampolines

9b4094a

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 26, 2024

pythongh-119726: generate and patch AArch64 trampolines

b9b646c

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added a commit to diegorusso/cpython that referenced this issue Sep 30, 2024

pythongh-119726: generate and patch AArch64 trampolines

d330f05

AArch64 trampolines are now generated at runtime at the end of every trace.

diegorusso added a commit to diegorusso/cpython that referenced this issue Oct 2, 2024

pythongh-119726: generate and patch AArch64 trampolines

b94d6ac

AArch64 trampolines are now generated at runtime at the end of every trace.

brandtbucher pushed a commit that referenced this issue Oct 2, 2024

GH-119726: Deduplicate AArch64 trampolines within a trace (GH-123872)

b85923a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: improve AArch64 code generation #119726

JIT: improve AArch64 code generation #119726

diegorusso commented May 29, 2024 •

edited

Loading

brandtbucher commented May 29, 2024

diegorusso commented May 30, 2024

diegorusso commented Aug 30, 2024 •

edited

Loading

diegorusso commented Sep 4, 2024

diegorusso commented Sep 9, 2024

JIT: improve AArch64 code generation #119726

JIT: improve AArch64 code generation #119726

Comments

diegorusso commented May 29, 2024 • edited Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

brandtbucher commented May 29, 2024

diegorusso commented May 30, 2024

diegorusso commented Aug 30, 2024 • edited Loading

diegorusso commented Sep 4, 2024

diegorusso commented Sep 9, 2024

diegorusso commented May 29, 2024 •

edited

Loading

diegorusso commented Aug 30, 2024 •

edited

Loading