New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-113464: A copy-and-patch JIT compiler #113465
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,171 @@ | |||
# pylint: disable = missing-class-docstring | |||
|
|||
"""Schema for the JSON produced by llvm-readobj --elf-output-style=JSON.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, seems like depending on llvm-readobj/JSON here is a bad form. Why not use libelf directly instead of writing this in python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to parse COFF (on Windows) and Mach-O (on macOS). llvm-readobj
can handle all three, and the JSON output mostly works for all three (despite the command-line switch's name). It would probably make sense to push LLVM to make COFF/Mach-O JSON serialization officially supported, so this is less fragile.
Plus, keeping things in Python is a Good Thing around here. :)
int failed = memory == NULL; | ||
#else | ||
int flags = MAP_ANONYMOUS | MAP_PRIVATE; | ||
char *memory = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apple Silicon requires an extra flag for mmap (MAP_JIT) and a kernel call (see docs) or is the mprotect
call below abstracting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally added that since I figured it was required. Then, later, I removed it just to see how it failed. Lo and behold, everything works fine. I think my assumption at the time was that the MAP_JIT
flag was only useful for code that needed to self-modify during execution (in our case, we write once, then never modify the code again after it begins executing). I can't really tell from the docs.
Could be I'm overlooking some edge case (maybe in the presence of multiple threads?) or configuration that I don't have locally, but I figured I'd wait until it actually broke for somebody before introducing more #ifdef
s. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. macOS's security controls are more lax when you're running a binary that you've compiled yourself, so this might only reappear when testing a packaged and signed version of Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can point me towards instructions for building a packaged and signed version of Python, I can try it out. I've just never done it before.
Note: the two failing Also, the JIT CI matrix doesn't include Apple silicon, and many tests are skipped when emulating AArch64 Linux (since they fail under emulation, JIT or no JIT). Local runs on both platforms are passing, but we probably want something better in the long term than "works on Brandt's machine". :) Working now through the 13(!) reviews on this draft PR. Apologies if I don't get to everyone today. |
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com> Co-authored-by: Nikita Sobolev <mail@sobolevn.me> Co-authored-by: David Brochart <david.brochart@gmail.com>
Hi, Brandt, Thanks for the marvelous work, I got a little bit question about this JIT machnism.
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Currently this PR in draft stage, so it's probably too early to "play around with this". |
Co-authored-by: Nikita Sobolev <mail@sobolevn.me>
Yep. Things on the roadmap (not in this PR) include:
Each of these increases the complexity a tiny bit, and probably deserve to be their own projects that are reviewed individually. I've roughly prototyped many of them to prove they're viable, though. |
This piggybacks on the existing tier two machinery (activated by passing
Not yet, but there almost certainly will be in the future. I think we need to play with the JIT a bit to see what kind of info/control is most useful.
It should be possible, but I haven't experimented with this at all. This is probably a related problem to making sure that C-level debuggers can work with the jitted code effectively, which I'm also not worrying about for now (contributions welcome once this lands, though)! |
Thanks a lot for your patience! I got another here I think the base template code is from the tire2 executor case. I'm very curious about the performance between the tire2 interpreter and the JITed code. |
As it stands now, it's somewhere between 2% and 9% faster than the tier two interpreter, depending on platform (individual benchmarks vary widely, from 13% slower to 47% faster). See my comment above for possible improvements to the generated code once this initial implementation is in (all of which are orthogonal to optimizing the trace itself, which is being worked on separately). |
Hi, Brandt, thanks for the amazing work. Allow me to ask a few little bit question about future optimisation.
|
Got it, I think we might need to have a continuous benchmark pipeline to evaluate the performance issue. About the test case, we may need to cover some real use case which is complex enough and will run for long time. Just like what the Ruby community do the benchmark(The Shopify run the JIT in main branch, and report the profile results to the community https://railsatscale.com |
One of two ways:
There are lots of things we can do with this, since at its core it's really just a general-purpose backend. :) But for register allocation, LLVM's
Not sure I follow... I don't know what you mean by "type mocks" (and Google isn't helping). |
We already have automated performance testing of a comprehensive benchmark suite, if that's what you mean: https://github.com/faster-cpython/benchmarking-public |
Sorry, my wording is not very standard. What I want to say is that generating binary template functions (optimised for inline) for bytecode (e.g. binary_op_add, but adding two list) by llvm, and then call them with a generic method
which can help make some of the less common (as opposed to int, float) bytecode operations also native |
I have seen this before, But a little bit different. I will try to do some more complex workflow(like Django with a lot of ORM query etc..) to benchmark some extra metric like the TPS improvment, CPU usgae etc... |
We did something like this with https://github.com/facebookarchive/django-workload some years ago, but I don't know how relevant that exact code is today. Also, I no longer work at FB. |
FWIW, the faster-cpython team does also run a number of additional high-level ("workload-oriented") benchmarks that are included in the results: https://github.com/pyston/python-macrobenchmarks/tree/main/benchmarks. |
'Twas the night before Christmas, when all through the code
Not a core dev was merging, not even Guido;
The CI was spun on the PRs with care
In hopes that green check-markings soon would be there;
The buildbots were nestled all snug under desks,
Even PPC64 AIX;
Doc-writers, triage team, the Council of Steering,
Had just stashed every change and stopped engineering,
When in the "PRs" tab arose such a clatter,
They opened GitHub to see what was the matter.
Away to CPython they flew like a flash,
Towards sounds of
PROT_EXEC
and__builtin___clear_cache
.First LLVM was downloaded, unzipped
Then the Actions were running a strange new build script,
When something appeared, they were stopped in their tracks,
jit_stencils.h
, generated from hacks,With their spines all a-shiver, they muttered "Oh, shit...",
They knew in a moment it must be a JIT.
More rapid than interpretation it came
And it copied-and-patched every stencil by name:
"Now,
_LOAD_FAST
! Now,_STORE_FAST
!_BINARY_OP_ADD_INT
!On,
_GUARD_DORV_VALUES_INST_ATTR_FROM_DICT
!To the top of the loop! And down into the call!
Now cache away! Cache away! Cache away all!"
But why now? And how so? They needed a hint,
Thankfully, Brandt gave a great talk at the sprint;
So over to YouTube the reviewers flew,
They read the white paper, and the blog post too.
And then, after watching, they saw its appeal
Not writing the code themselves seemed so unreal.
And the platform support was almost too easy,
ARM64 Macs to 32-bit PCs.
There was some runtime C, not too much, just enough,
Basically a loader, relocating stuff;
It ran every test, one by one passed them all,
With not one runtime dependency to install.
Mostly build-time Python! With strict static typing!
For maintenance ease, and also nerd-sniping!
Though dispatch was faster, the JIT wasn't wise,
And the traces it used still should be optimized;
The code it was JIT'ing still needed some thinning,
With code models small, and some register pinning;
Or new calling conventions, shared stubs for paths slow,
Since this JIT was brand new, there was fruit hanging low.
It was awkwardly large, parsed straight out of the ELFs,
And they laughed when they saw it, in spite of themselves;
A
configure
flag, and no merging this year,Soon gave them to know they had nothing to fear;
It wasn't much faster, at least it could work,
They knew that'd come later; no one was a jerk,
But they were still smart, and determined, and skilled,
They opened a shell, and configured the build;
--enable-experimental-jit
, then made it,And away the JIT flew as their "+1"s okay'ed it.
But they heard it exclaim, as it traced out of sight,
"Happy JIT-mas to all, and to all a good night!"