GH-113464: A copy-and-patch JIT compiler #113465

brandtbucher · 2023-12-25T07:17:38Z

'Twas the night before Christmas, when all through the code
Not a core dev was merging, not even Guido;
The CI was spun on the PRs with care
In hopes that green check-markings soon would be there;
The buildbots were nestled all snug under desks,
Even PPC64 AIX;
Doc-writers, triage team, the Council of Steering,
Had just stashed every change and stopped engineering,

When in the "PRs" tab arose such a clatter,
They opened GitHub to see what was the matter.
Away to CPython they flew like a flash,
Towards sounds of PROT_EXEC and __builtin___clear_cache.
First LLVM was downloaded, unzipped
Then the Actions were running a strange new build script,
When something appeared, they were stopped in their tracks,
jit_stencils.h, generated from hacks,
With their spines all a-shiver, they muttered "Oh, shit...",
They knew in a moment it must be a JIT.

More rapid than interpretation it came
And it copied-and-patched every stencil by name:
"Now, _LOAD_FAST! Now, _STORE_FAST! _BINARY_OP_ADD_INT!
On, _GUARD_DORV_VALUES_INST_ATTR_FROM_DICT!
To the top of the loop! And down into the call!
Now cache away! Cache away! Cache away all!"
But why now? And how so? They needed a hint,
Thankfully, Brandt gave a great talk at the sprint;
So over to YouTube the reviewers flew,
They read the white paper, and the blog post too.

And then, after watching, they saw its appeal
Not writing the code themselves seemed so unreal.
And the platform support was almost too easy,
ARM64 Macs to 32-bit PCs.
There was some runtime C, not too much, just enough,
Basically a loader, relocating stuff;
It ran every test, one by one passed them all,
With not one runtime dependency to install.
Mostly build-time Python! With strict static typing!
For maintenance ease, and also nerd-sniping!

Though dispatch was faster, the JIT wasn't wise,
And the traces it used still should be optimized;
The code it was JIT'ing still needed some thinning,
With code models small, and some register pinning;
Or new calling conventions, shared stubs for paths slow,
Since this JIT was brand new, there was fruit hanging low.
It was awkwardly large, parsed straight out of the ELFs,
And they laughed when they saw it, in spite of themselves;

A configure flag, and no merging this year,
Soon gave them to know they had nothing to fear;
It wasn't much faster, at least it could work,
They knew that'd come later; no one was a jerk,
But they were still smart, and determined, and skilled,
They opened a shell, and configured the build;
--enable-experimental-jit, then made it,
And away the JIT flew as their "+1"s okay'ed it.
But they heard it exclaim, as it traced out of sight,
"Happy JIT-mas to all, and to all a good night!"

Issue: JIT Compilation #113464

Tools/jit/README.md

pinskia · 2023-12-28T20:57:06Z

Tools/jit/schema.py

@@ -0,0 +1,171 @@
+# pylint: disable = missing-class-docstring
+
+"""Schema for the JSON produced by llvm-readobj --elf-output-style=JSON."""


Hmm, seems like depending on llvm-readobj/JSON here is a bad form. Why not use libelf directly instead of writing this in python?

We also need to parse COFF (on Windows) and Mach-O (on macOS). llvm-readobj can handle all three, and the JSON output mostly works for all three (despite the command-line switch's name). It would probably make sense to push LLVM to make COFF/Mach-O JSON serialization officially supported, so this is less fragile.

Plus, keeping things in Python is a Good Thing around here. :)

tonybaloney · 2023-12-30T21:18:43Z

Python/jit.c

+    int failed = memory == NULL;
+#else
+    int flags = MAP_ANONYMOUS | MAP_PRIVATE;
+    char *memory = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);


Apple Silicon requires an extra flag for mmap (MAP_JIT) and a kernel call (see docs) or is the mprotect call below abstracting this?

I originally added that since I figured it was required. Then, later, I removed it just to see how it failed. Lo and behold, everything works fine. I think my assumption at the time was that the MAP_JIT flag was only useful for code that needed to self-modify during execution (in our case, we write once, then never modify the code again after it begins executing). I can't really tell from the docs.

Could be I'm overlooking some edge case (maybe in the presence of multiple threads?) or configuration that I don't have locally, but I figured I'd wait until it actually broke for somebody before introducing more #ifdefs. :)

Fair. macOS's security controls are more lax when you're running a binary that you've compiled yourself, so this might only reappear when testing a packaged and signed version of Python

If you can point me towards instructions for building a packaged and signed version of Python, I can try it out. I've just never done it before.

brandtbucher · 2024-01-02T23:38:52Z

Note: the two failing i686-pc-windows-msvc/msvc and x86_64-pc-windows-msvc/msvc release JIT builds are due to GH-113655.

Also, the JIT CI matrix doesn't include Apple silicon, and many tests are skipped when emulating AArch64 Linux (since they fail under emulation, JIT or no JIT). Local runs on both platforms are passing, but we probably want something better in the long term than "works on Brandt's machine". :)

Working now through the 13(!) reviews on this draft PR. Apologies if I don't get to everyone today.

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> Co-authored-by: Nikita Sobolev <mail@sobolevn.me> Co-authored-by: David Brochart <david.brochart@gmail.com>

Zheaoli · 2024-01-03T06:34:13Z

Hi, Brandt, Thanks for the marvelous work, I got a little bit question about this JIT machnism.

The time we chose to JIT is chosen by the Optimizor right. Is there some data to represent how much the possibility to be compiled as native code for the normal workflows. If there is not enough data now, I would like to take some benchmark for this PR.
Is there any way to monitor the JIT status
After gh-96143: Allow Linux perf profiler to see Python calls #96123, we make a trampoline to point the code address and symbol, so the user can use perf or other tools to monitor the code by add-hook on some user space address, Is it possible to do the same thing after the JIT ?( I would like to help with this feature

Eclips4 · 2024-01-03T22:00:54Z

If you want to play around with this using pyenv, here's a patch for pyenv:

Details
Then just run:
cd "$(pyenv root)"
git apply file.diff
pyenv install 3.13-dev

Currently this PR in draft stage, so it's probably too early to "play around with this".
Also, this PR contains too much "off-topic" comments(IIRC, some of them have already deleted), lets keep them minimum as possible, please.
I'm apologize if this message offended you (or someone else), but someone should write this. This PR attracts too much attention (and it's well deserved, Brandt!), so, let's continue to keep a "working atmosphere", please. Thanks. 😄

Co-authored-by: Nikita Sobolev <mail@sobolevn.me>

brandtbucher · 2024-01-04T00:04:03Z

I'm curious, given the perf results reported in your talk, do you have any documented ideas on improving the generated code - either by tinkering with whatever gets generated (though I'm aware messing with it too much manually defeats the idea of having it work "magically"), or by improving the template for LLVM? Some of the things I see in generated code seem really pessimized, like the obligatory jump-to-continue in each op (with a jump-to-register too, probably enforced by mcmodel=large), or 64-bit oparg immediates.

Yep. Things on the roadmap (not in this PR) include:

removing simple zero-length jumps at the ends of stencils in a postprocessing step
using the small or medium code models for stencils that don't require 64-bit holes
using the ghccc calling convention for more efficient tail calls (way less pushing, popping, and register shuffling at the beginning and end of instructions)
using shared stubs for slow paths
using shared const data instead of duplicating stuff like static strings every time they're used
top-of-stack caching in registers (plays nicely with ghccc, above)
compiling different variants of each stencil when the oparg changes control flow
compiling super-stencils that combine common sequences of instructions

Each of these increases the complexity a tiny bit, and probably deserve to be their own projects that are reviewed individually. I've roughly prototyped many of them to prove they're viable, though.

brandtbucher · 2024-01-04T00:13:52Z

Hi, Brandt, Thanks for the marvelous work, I got a little bit question about this JIT machnism.

The time we chose to JIT is chosen by the Optimizor right. Is there some data to represent how much the possibility to be compiled as native code for the normal workflows. If there is not enough data now, I would like to take some benchmark for this PR.

This piggybacks on the existing tier two machinery (activated by passing -X uops on the command line or setting the PYTHON_UOPS environment variable). You can build an instrumented version of main today using --enable-pystats, which dumps tons of internal counters. These include stats on how effective tier two is at finding, optimizing, and executing hot spots in your code.

Is there any way to monitor the JIT status

Not yet, but there almost certainly will be in the future. I think we need to play with the JIT a bit to see what kind of info/control is most useful.

After gh-96143: Allow Linux perf profiler to see Python calls #96123, we make a trampoline to point the code address and symbol, so the user can use perf or other tools to monitor the code by add-hook on some user space address, Is it possible to do the same thing after the JIT ?( I would like to help with this feature

It should be possible, but I haven't experimented with this at all. This is probably a related problem to making sure that C-level debuggers can work with the jitted code effectively, which I'm also not worrying about for now (contributions welcome once this lands, though)!

Zheaoli · 2024-01-04T02:24:30Z

Hi, Brandt, Thanks for the marvelous work, I got a little bit question about this JIT machnism.

The time we chose to JIT is chosen by the Optimizor right. Is there some data to represent how much the possibility to be compiled as native code for the normal workflows. If there is not enough data now, I would like to take some benchmark for this PR.

This piggybacks on the existing tier two machinery (activated by passing -X uops on the command line or setting the PYTHON_UOPS environment variable). You can build an instrumented version of main today using --enable-pystats, which dumps tons of internal counters. These include stats on how effective tier two is at finding, optimizing, and executing hot spots in your code.

Is there any way to monitor the JIT status

Not yet, but there almost certainly will be in the future. I think we need to play with the JIT a bit to see what kind of info/control is most useful.

After gh-96143: Allow Linux perf profiler to see Python calls #96123, we make a trampoline to point the code address and symbol, so the user can use perf or other tools to monitor the code by add-hook on some user space address, Is it possible to do the same thing after the JIT ?( I would like to help with this feature

It should be possible, but I haven't experimented with this at all. This is probably a related problem to making sure that C-level debuggers can work with the jitted code effectively, which I'm also not worrying about for now (contributions welcome once this lands, though)!

Thanks a lot for your patience! I got another here

I think the base template code is from the tire2 executor case. I'm very curious about the performance between the tire2 interpreter and the JITed code.

brandtbucher · 2024-01-04T02:31:03Z

I think the base template code is from the tire2 executor case. I'm very curious about the performance between the tire2 interpreter and the JITed code.

As it stands now, it's somewhere between 2% and 9% faster than the tier two interpreter, depending on platform (individual benchmarks vary widely, from 13% slower to 47% faster). See my comment above for possible improvements to the generated code once this initial implementation is in (all of which are orthogonal to optimizing the trace itself, which is being worked on separately).

penguin-wwy · 2024-01-04T02:45:35Z

Hi, Brandt, thanks for the amazing work. Allow me to ask a few little bit question about future optimisation.

The current implementation only supports binary templates for a single bytecode node. So how are we going to support supernode of a common bytecode sequence.
In addition to support for supernodes, perhaps we could somehow stitch binary templates to generate a function or superblock, and if we did that perhaps we'd need to customise the register allocation algorithm for pass parameters and removes calling overhead between stencils.
Finally, can we generate inline-optimised templates for C-API with type mocks, such as pyston does. These help us to make the less common types of bytecode native as well (e.g _binary_op_add_list).

Zheaoli · 2024-01-04T02:54:31Z

I think the base template code is from the tire2 executor case. I'm very curious about the performance between the tire2 interpreter and the JITed code.

As it stands now, it's somewhere between 2% and 9% faster than the tier two interpreter, depending on platform (individual benchmarks vary widely, from 13% slower to 47% faster). See my comment above for possible improvements to the generated code once this initial implementation is in (all of which are orthogonal to optimizing the trace itself, which is being worked on separately).

Got it， I think we might need to have a continuous benchmark pipeline to evaluate the performance issue.

About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com

brandtbucher · 2024-01-04T02:56:16Z

Hi, Brandt, thanks for the amazing work. Allow me to ask a few little bit question about future optimisation.

The current implementation only supports binary templates for a single bytecode node. So how are we going to support supernode of a common bytecode sequence.

One of two ways:

the tier two optimizer can combine tier two instructions into superinstructions before the JIT even sees them (then, to the JIT, they are just normal instructions)
in addition to individual stencils, we'll also compile stencils for common pairs or triples of instructions (then the JIT can use them if they show up in the trace)

In addition to support for supernodes, perhaps we could somehow stitch binary templates to generate a function or superblock, and if we did that perhaps we'd need to customise the register allocation algorithm for pass parameters and removes calling overhead between stencils.

There are lots of things we can do with this, since at its core it's really just a general-purpose backend. :)

But for register allocation, LLVM's ghccc calling convention makes it very easy to pin registers across the tail calls by passing them as arguments to the continuation... so we actually have a surprising amount of control there!

Finally, can we generate inline-optimised templates for C-API with type mocks, such as pyston does. These help us to make the less common types of bytecode native as well (e.g _binary_op_add_list).

Not sure I follow... I don't know what you mean by "type mocks" (and Google isn't helping).

brandtbucher · 2024-01-04T02:58:17Z

Got it， I think we might need to have a continuous benchmark pipeline to evaluate the performance issue.

About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com

We already have automated performance testing of a comprehensive benchmark suite, if that's what you mean: https://github.com/faster-cpython/benchmarking-public

penguin-wwy · 2024-01-04T03:16:54Z

Hi, Brandt, thanks for the amazing work. Allow me to ask a few little bit question about future optimisation.

The current implementation only supports binary templates for a single bytecode node. So how are we going to support supernode of a common bytecode sequence.

One of two ways:
* the tier two optimizer can combine tier two instructions into superinstructions before the JIT even sees them (then, to the JIT, they are just normal instructions)

* in addition to individual stencils, we'll also compile stencils for common pairs or triples of instructions (then the JIT can use them if they show up in the trace)
In addition to support for supernodes, perhaps we could somehow stitch binary templates to generate a function or superblock, and if we did that perhaps we'd need to customise the register allocation algorithm for pass parameters and removes calling overhead between stencils.

There are lots of things we can do with this, since at its core it's really just a general-purpose backend. :)

But for register allocation, LLVM's ghccc calling convention makes it very easy to pin registers across the tail calls by passing them as arguments to the continuation... so we actually have a surprising amount of control there!

Finally, can we generate inline-optimised templates for C-API with type mocks, such as pyston does. These help us to make the less common types of bytecode native as well (e.g _binary_op_add_list).

Not sure I follow... I don't know what you mean by "type mocks" (and Google isn't helping).

Sorry, my wording is not very standard. What I want to say is that generating binary template functions (optimised for inline) for bytecode (e.g. binary_op_add, but adding two list) by llvm, and then call them with a generic method

add_tow_list = load_fast + load_fast + binary_op_add:
    mov xxx
    mov yyy
    call  (X86_64_RELOC_UNSIGNED)   -> redirect to list_extend

which can help make some of the less common (as opposed to int, float) bytecode operations also native

Zheaoli · 2024-01-04T03:19:33Z

Got it， I think we might need to have a continuous benchmark pipeline to evaluate the performance issue.
About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com

We already have automated performance testing of a comprehensive benchmark suite, if that's what you mean: https://github.com/faster-cpython/benchmarking-public

I have seen this before, But a little bit different. I will try to do some more complex workflow(like Django with a lot of ORM query etc..) to benchmark some extra metric like the TPS improvment, CPU usgae etc...

tekknolagi · 2024-01-04T07:48:57Z

Got it， I think we might need to have a continuous benchmark pipeline to evaluate the performance issue.
About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com

We already have automated performance testing of a comprehensive benchmark suite, if that's what you mean: https://github.com/faster-cpython/benchmarking-public

I have seen this before, But a little bit different. I will try to do some more complex workflow(like Django with a lot of ORM query etc..) to benchmark some extra metric like the TPS improvment, CPU usgae etc...

We did something like this with https://github.com/facebookarchive/django-workload some years ago, but I don't know how relevant that exact code is today. Also, I no longer work at FB.

ericsnowcurrently · 2024-01-04T17:38:57Z

Got it， I think we might need to have a continuous benchmark pipeline to evaluate the performance issue.
About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com

We already have automated performance testing of a comprehensive benchmark suite, if that's what you mean: https://github.com/faster-cpython/benchmarking-public

I have seen this before, But a little bit different. I will try to do some more complex workflow(like Django with a lot of ORM query etc..) to benchmark some extra metric like the TPS improvment, CPU usgae etc...

FWIW, the faster-cpython team does also run a number of additional high-level ("workload-oriented") benchmarks that are included in the results: https://github.com/pyston/python-macrobenchmarks/tree/main/benchmarks.

brandtbucher and others added 30 commits July 15, 2023 17:17

Skip problematic tests under emulation

41c25a0

Catch up with main

a4921c7

Catch up with justin

7070e7a

Catch up with main

a4a180c

Compile uops!

8f4caec

Wow, that was insanely easy

d8976ba

Turn off asserts

bc13efb

Turn off asserts

ba47232

Catch up with main

3bfe87d

Get the remaining uops working

1c68f28

Catch up with main

789455b

Disable some known (not my fault!) test failures

54b880a

Uops!

b9179fb

Clean up the diff

fea7b2b

fixup

2fef1af

fixup

96e0861

fixup

c7d8592

Fix whitespace

ee4d088

Fix Windows failures

29e76c7

Catch up with main

14c1f10

Embed jitted memory in the executable???

d85ac9f

Bring the memory cap down

b04b725

Put multiple traces on each page

6baa004

Fix M1 support

2bd0ca2

More platform whack-a-mole

31ef618

Add explicit cache flushes

2a02da4

Catch up with main

5155ccb

Make a real merge commit

0f435ef

Fix test_tools

c8143a0

Clean up the diff

4ba38f0

davidbrochart reviewed Dec 28, 2023

View reviewed changes

Tools/jit/README.md Outdated Show resolved Hide resolved

pinskia reviewed Dec 28, 2023

View reviewed changes

stonebig mentioned this pull request Dec 30, 2023

Release 2024-01 follow-up winpython/winpython#1274

Open

1 task

tonybaloney reviewed Dec 30, 2023

View reviewed changes

brandtbucher mentioned this pull request Jan 2, 2024

Test failures when running with Tier 2 enabled #113657

Closed

brandtbucher added 4 commits January 2, 2024 12:16

Catch up with main

412c68d

Add Include/internal/mimalloc to include path

b1f1c9f

Catch up with main

e9faa27

Re-enable JIT CI

4229327

Lots of little cleanups from code review

2643439

Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> Co-authored-by: Nikita Sobolev <mail@sobolevn.me> Co-authored-by: David Brochart <david.brochart@gmail.com>

Catch up with main

3342775

This comment was marked as off-topic.

Sign in to view

brandtbucher and others added 2 commits January 3, 2024 14:03

Move some logic out of try blocks

c1b7007

Co-authored-by: Nikita Sobolev <mail@sobolevn.me>

Clean up some type annotations

20ad5f5

markshannon mentioned this pull request Jan 9, 2024

Make all executors execute tier 2 instructions (micro-ops) #113860

Open

GH-113464: A copy-and-patch JIT compiler #113465

GH-113464: A copy-and-patch JIT compiler #113465

brandtbucher commented Dec 25, 2023 •

edited by bedevere-app bot

pinskia Dec 28, 2023

brandtbucher Jan 3, 2024

tonybaloney Dec 30, 2023

brandtbucher Jan 2, 2024

tonybaloney Jan 3, 2024

brandtbucher Jan 3, 2024

brandtbucher commented Jan 2, 2024

Zheaoli commented Jan 3, 2024 •

edited

This comment was marked as off-topic.

Eclips4 commented Jan 3, 2024 •

edited

brandtbucher commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

brandtbucher commented Jan 4, 2024 •

edited

penguin-wwy commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

penguin-wwy commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

tekknolagi commented Jan 4, 2024 •

edited

ericsnowcurrently commented Jan 4, 2024

		@@ -0,0 +1,171 @@
		# pylint: disable = missing-class-docstring

		"""Schema for the JSON produced by llvm-readobj --elf-output-style=JSON."""

GH-113464: A copy-and-patch JIT compiler #113465

Are you sure you want to change the base?

GH-113464: A copy-and-patch JIT compiler #113465

Conversation

brandtbucher commented Dec 25, 2023 • edited by bedevere-app bot

pinskia Dec 28, 2023

Choose a reason for hiding this comment

brandtbucher Jan 3, 2024

Choose a reason for hiding this comment

tonybaloney Dec 30, 2023

Choose a reason for hiding this comment

brandtbucher Jan 2, 2024

Choose a reason for hiding this comment

tonybaloney Jan 3, 2024

Choose a reason for hiding this comment

brandtbucher Jan 3, 2024

Choose a reason for hiding this comment

brandtbucher commented Jan 2, 2024

Zheaoli commented Jan 3, 2024 • edited

This comment was marked as off-topic.

Eclips4 commented Jan 3, 2024 • edited

brandtbucher commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

brandtbucher commented Jan 4, 2024 • edited

penguin-wwy commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

brandtbucher commented Jan 4, 2024

penguin-wwy commented Jan 4, 2024

Zheaoli commented Jan 4, 2024

tekknolagi commented Jan 4, 2024 • edited

ericsnowcurrently commented Jan 4, 2024

brandtbucher commented Dec 25, 2023 •

edited by bedevere-app bot

Zheaoli commented Jan 3, 2024 •

edited

Eclips4 commented Jan 3, 2024 •

edited

brandtbucher commented Jan 4, 2024 •

edited

tekknolagi commented Jan 4, 2024 •

edited