New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-90536: Add support for the BOLT post-link binary optimizer #95908
base: main
Are you sure you want to change the base?
Conversation
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11. |
Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12. |
Nice work! I will take a look at this PR by this weekend |
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
This comment has been hidden.
This comment has been hidden.
Two things need to be checked.
I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used.-> solved
BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805).
#0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
#1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0
#2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106)
#4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658)
#5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0
#6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a)
#7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8)
#8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6)
#9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8)
#10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20
#12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped
While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through-> solvedpython -m test
./python.bolt_inst -m test --pgo --timeout=1200 || true
0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min)
0:00:00 load avg: 2.17 [ 1/44] test_array
0:00:01 load avg: 2.17 [ 2/44] test_base64
0:00:02 load avg: 2.07 [ 3/44] test_binascii
0:00:02 load avg: 2.07 [ 4/44] test_binop
0:00:02 load avg: 2.07 [ 5/44] test_bisect
0:00:02 load avg: 2.07 [ 6/44] test_bytes
0:00:06 load avg: 2.07 [ 7/44] test_bz2
0:00:06 load avg: 2.07 [ 8/44] test_cmath
0:00:07 load avg: 2.07 [ 9/44] test_codecs
0:00:08 load avg: 1.99 [10/44] test_collections
0:00:09 load avg: 1.99 [11/44] test_complex
0:00:10 load avg: 1.99 [12/44] test_dataclasses
0:00:10 load avg: 1.99 [13/44] test_datetime
0:00:14 load avg: 1.83 [14/44] test_decimal
0:00:18 load avg: 1.76 [15/44] test_difflib
0:00:19 load avg: 1.76 [16/44] test_embed
0:00:21 load avg: 1.76 [17/44] test_float
0:00:22 load avg: 1.76 [18/44] test_fstring
0:00:23 load avg: 1.70 [19/44] test_functools
0:00:23 load avg: 1.70 [20/44] test_generators
0:00:24 load avg: 1.70 [21/44] test_hashlib
0:00:25 load avg: 1.70 [22/44] test_heapq
0:00:26 load avg: 1.70 [23/44] test_int
0:00:26 load avg: 1.70 [24/44] test_itertools
0:00:32 load avg: 1.64 [25/44] test_json
0:00:36 load avg: 1.59 [26/44] test_long
0:00:39 load avg: 1.54 [27/44] test_lzma
0:00:39 load avg: 1.54 [28/44] test_math
0:00:42 load avg: 1.50 [29/44] test_memoryview
0:00:43 load avg: 1.50 [30/44] test_operator
0:00:44 load avg: 1.50 [31/44] test_ordered_dict
0:00:46 load avg: 1.50 [32/44] test_patma
0:00:46 load avg: 1.50 [33/44] test_pickle
0:00:52 load avg: 1.46 [34/44] test_pprint
0:00:52 load avg: 1.42 [35/44] test_re
0:00:53 load avg: 1.42 [36/44] test_set
0:01:00 load avg: 1.39 [37/44] test_sqlite3
0:01:05 load avg: 1.36 [38/44] test_statistics
0:01:10 load avg: 1.33 [39/44] test_struct
0:01:11 load avg: 1.33 [40/44] test_tabnanny
0:01:12 load avg: 1.30 [41/44] test_time
0:01:15 load avg: 1.30 [42/44] test_unicode
test test_unicode failed
0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure)
0:01:19 load avg: 1.28 [44/44] test_xml_etree_c
Total duration: 1 min 21 sec
Tests result: FAILURE
I will share further investigation into this patch.
FYI, this is my environment.
- OS: Ubuntu 22.04 LTS
- BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0
- gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
Hmm, I will try to build BOLT from LLVM 14.0.6 |
I found why the BOLT was failed, I will downgrade the gcc version into 10.
|
Thanks for work! All pipeline works correctly.
Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.
I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.
Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.
plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@gvanrossum @kmod cc @markshannon Interesting result!
Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence |
Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0): But let's measure the benchmark from the Faster CPython machine after the PR is merged. |
Makefile.pre.in
Outdated
@@ -640,6 +640,15 @@ profile-opt: profile-run-stamp | |||
-rm -f profile-clean-stamp | |||
$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)" | |||
|
|||
bolt-opt: @PREBOLT_RULE@ | |||
rm -f *.fdata | |||
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | |
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst |
Makefile.pre.in
Outdated
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst | ||
./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true | ||
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata | ||
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot | |
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot |
I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report. Environment
Binary Size
ICache miss
Benchmark (1.01x faster)https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5 |
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Thanks for taking a look! Yes many of the Pyston macrobenchmarks broke in 3.11 but it looks like @mdboom is currently working updating the dependencies to versions that are compatible with 3.11. I have made the requested changes; please review again |
Thanks for making the requested changes! @corona10: please review the changes made to this pull request. |
Doc/using/configure.rst
Outdated
Configuring Python using ``--enable-optimizations --with-lto --enable-bolt`` | ||
(PGO + LTO + BOLT) is recommended for best performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last one:
Let's be conservative, I would like to introduce the BOLT option as experimental for this time.
I wish that we can change this sentence in the future version.
Configuring Python using ``--enable-optimizations --with-lto --enable-bolt`` | |
(PGO + LTO + BOLT) is recommended for best performance. | |
Configuring Python using ``--enable-optimizations --with-lto`` | |
(PGO + LTO) is recommended for optimal performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay to introduce (PGO + LTO + BOLT) as an experimental combination. So it is okay to introduce both of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, also having BOLT installed is an extra requirement that most users won't have, so I would not recommend to advertise it with the other two options that just depend on the compiler toolchain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! What do you think of this new wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is good! Maybe add a link to some more detailed instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOLT is part of the LLVM project but is not always included in their binary
distributions. This flag requires thatllvm-bolt
andmerge-fdata
are available
I think that this sentence is enough. In near future, BOLT will be included in LLVM binary distributions by default.
Detail installation guide will be changed up to their situation, so the Iink can be broken anytime.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. What do you think about adding a link to the BOLT webpage or repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. What do you think about adding a link to the BOLT webpage or repo?
@kmod
I think that we can add the link to https://github.com/llvm/llvm-project/tree/main/bolt since there is no official page for BOLT under llvm.org, would you like to add it to cmdoption:: --enable-bolt
section? (Or you can link the better page such as https://github.com/facebookincubator/BOLT, I am not sure which page is better)
Thanks for your hard work!
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be put in the comfy chair! |
LGTM
I will merge this PR after listening to @pablogsal 's opinion with #95908 (comment).
Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.
It is gated behind an
--enable-bolt
configure arg because not alltoolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).
Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.
The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).
The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.
This change uses the existing pgo profiling task (
python -m test --pgo
),though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.