gh-90536: Add support for the BOLT post-link binary optimizer #95908

kmod · 2022-08-11T22:05:12Z

Using bolt
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an --enable-bolt configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to a previous attempt,
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (python -m test --pgo),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.

Issue: Experiment with LLVM BOLT binary optimizer #90536

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

bedevere-bot · 2022-08-11T22:05:15Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

gvanrossum · 2022-08-11T23:21:48Z

Thanks! I hope @corona10 can review and merge this, and maybe @pablogsal will be willing to backport it to 3.11.

pablogsal · 2022-08-11T23:27:12Z

and maybe @pablogsal will be willing to backport it to 3.11.

Unfortunately, changes in the configure script or makefile are too much at this stage, especially for a new feature that has not been tested in the wild (by users checking the pre-releases). Sadly, this must go to 3.12.

corona10 · 2022-08-11T23:29:21Z

Nice work! I will take a look at this PR by this weekend

bedevere-bot · 2022-08-12T16:55:17Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

corona10

Two things need to be checked.

~~I failed to build the binary with this patch, This can be due to the BOLT bug but I would like to know which BOLT version you used.~~ -> solved

BOLT-INFO: Allocation combiner: 30 empty spaces coalesced (dyn count: 63791805).
 #0 0x0000563eb3e8d705 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x0000563eb3e8b2d4 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007fc228930520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #3 0x0000563eb4ebd106 llvm::bolt::BinaryFunction::translateInputToOutputAddress(unsigned long) const (/usr/local/bin/llvm-bolt+0x1c52106)
 #4 0x0000563eb3f52658 llvm::bolt::DWARFRewriter::updateUnitDebugInfo(llvm::DWARFUnit&, llvm::bolt::DebugInfoBinaryPatcher&, llvm::bolt::DebugAbbrevWriter&, llvm::bolt::DebugLocWriter&, llvm::bolt::DebugRangesSectionWriter&, llvm::Optional<unsigned long>) (/usr/local/bin/llvm-bolt+0xce7658)
 #5 0x0000563eb3f5688b llvm::bolt::DWARFRewriter::updateDebugInfo()::'lambda0'(unsigned long, llvm::DWARFUnit*)::operator()(unsigned long, llvm::DWARFUnit*) const DWARFRewriter.cpp:0:0
 #6 0x0000563eb3f5c45a llvm::bolt::DWARFRewriter::updateDebugInfo() (/usr/local/bin/llvm-bolt+0xcf145a)
 #7 0x0000563eb3f1aef8 llvm::bolt::RewriteInstance::updateMetadata() (/usr/local/bin/llvm-bolt+0xcafef8)
 #8 0x0000563eb3f428e6 llvm::bolt::RewriteInstance::run() (/usr/local/bin/llvm-bolt+0xcd78e6)
 #9 0x0000563eb355ccf8 main (/usr/local/bin/llvm-bolt+0x2f1cf8)
#10 0x00007fc228917d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#11 0x00007fc228917e40 call_init ./csu/../csu/libc-start.c:128:20
#12 0x00007fc228917e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#13 0x0000563eb35dbd75 _start (/usr/local/bin/llvm-bolt+0x370d75)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /usr/local/bin/llvm-bolt python -o python.bolt -data=python.fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
make: *** [Makefile:800: bolt-opt] Segmentation fault (core dumped

While profiling, I met the test failure, would you like to check that the optimized binary pass all std python test? (e.g python -m test), I met the related issue with the last attempts and it was solved by profiling through python -m test -> solved

./python.bolt_inst -m test --pgo --timeout=1200 || true
0:00:00 load avg: 2.17 Run tests sequentially (timeout: 20 min)
0:00:00 load avg: 2.17 [ 1/44] test_array
0:00:01 load avg: 2.17 [ 2/44] test_base64
0:00:02 load avg: 2.07 [ 3/44] test_binascii
0:00:02 load avg: 2.07 [ 4/44] test_binop
0:00:02 load avg: 2.07 [ 5/44] test_bisect
0:00:02 load avg: 2.07 [ 6/44] test_bytes
0:00:06 load avg: 2.07 [ 7/44] test_bz2
0:00:06 load avg: 2.07 [ 8/44] test_cmath
0:00:07 load avg: 2.07 [ 9/44] test_codecs
0:00:08 load avg: 1.99 [10/44] test_collections
0:00:09 load avg: 1.99 [11/44] test_complex
0:00:10 load avg: 1.99 [12/44] test_dataclasses
0:00:10 load avg: 1.99 [13/44] test_datetime
0:00:14 load avg: 1.83 [14/44] test_decimal
0:00:18 load avg: 1.76 [15/44] test_difflib
0:00:19 load avg: 1.76 [16/44] test_embed
0:00:21 load avg: 1.76 [17/44] test_float
0:00:22 load avg: 1.76 [18/44] test_fstring
0:00:23 load avg: 1.70 [19/44] test_functools
0:00:23 load avg: 1.70 [20/44] test_generators
0:00:24 load avg: 1.70 [21/44] test_hashlib
0:00:25 load avg: 1.70 [22/44] test_heapq
0:00:26 load avg: 1.70 [23/44] test_int
0:00:26 load avg: 1.70 [24/44] test_itertools
0:00:32 load avg: 1.64 [25/44] test_json
0:00:36 load avg: 1.59 [26/44] test_long
0:00:39 load avg: 1.54 [27/44] test_lzma
0:00:39 load avg: 1.54 [28/44] test_math
0:00:42 load avg: 1.50 [29/44] test_memoryview
0:00:43 load avg: 1.50 [30/44] test_operator
0:00:44 load avg: 1.50 [31/44] test_ordered_dict
0:00:46 load avg: 1.50 [32/44] test_patma
0:00:46 load avg: 1.50 [33/44] test_pickle
0:00:52 load avg: 1.46 [34/44] test_pprint
0:00:52 load avg: 1.42 [35/44] test_re
0:00:53 load avg: 1.42 [36/44] test_set
0:01:00 load avg: 1.39 [37/44] test_sqlite3
0:01:05 load avg: 1.36 [38/44] test_statistics
0:01:10 load avg: 1.33 [39/44] test_struct
0:01:11 load avg: 1.33 [40/44] test_tabnanny
0:01:12 load avg: 1.30 [41/44] test_time
0:01:15 load avg: 1.30 [42/44] test_unicode
test test_unicode failed
0:01:17 load avg: 1.28 [43/44] test_xml_etree -- test_unicode failed (1 failure)
0:01:19 load avg: 1.28 [44/44] test_xml_etree_c

Total duration: 1 min 21 sec
Tests result: FAILURE

I will share further investigation into this patch.
FYI, this is my environment.

- OS: Ubuntu 22.04 LTS
- BOLT revision e9b213131ae9c57f4f151d3206916676135b31b0
- gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0

corona10 · 2022-08-13T11:33:43Z

Hmm, I will try to build BOLT from LLVM 14.0.6

corona10 · 2022-08-13T12:26:09Z

I found why the BOLT was failed, I will downgrade the gcc version into 10.


DWARF 5 has become the default in GCC 11

corona10

Thanks for work! All pipeline works correctly.

Please update https://github.com/python/cpython/blob/main/Doc/using/configure.rst too.
(If possible https://github.com/python/cpython/blob/main/Doc/whatsnew/3.12.rst too, I will update the whats new if you are too busy)
But please emphasize that this feature is experimental optimization support.

I am going to measure the performance enhancement soon through the pyperformance and also for the l1 i-cache miss ratio.

Looks like https://github.com/pyston/python-macrobenchmarks does not support Python 3.1[1-2] yet right? Please let me know if I know wrong.

plus
https://github.com/python/cpython/blob/main/Misc/ACKS Add your name in this file too :)

Makefile.pre.in

configure.ac

bedevere-bot · 2022-08-13T14:59:42Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

corona10 · 2022-08-13T16:18:09Z

@gvanrossum @kmod cc @markshannon

Interesting result!
The following benchmark was measured on AWS c5n.metal / gcc-10. (base commit: f235178)
I wish to re-measure the benchmark from the FasterCPython project machine also.
I am going to measure the L1 i-cache miss ratio soon where the perf tool is available.

Benchmark	CPython 3.12 ./configure --enable-optimizations --with-lto	CPython 3.12 ./configure --enable-optimizations --with-lto --enable-bolt
2to3	269 ms	255 ms: 1.05x faster
chameleon	7.39 ms	7.02 ms: 1.05x faster
chaos	74.1 ms	68.8 ms: 1.08x faster
crypto_pyaes	82.3 ms	77.2 ms: 1.07x faster
deltablue	3.65 ms	3.41 ms: 1.07x faster
django_template	38.6 ms	35.3 ms: 1.09x faster
dulwich_log	67.6 ms	58.7 ms: 1.15x faster
fannkuch	385 ms	380 ms: 1.02x faster
float	73.2 ms	72.4 ms: 1.01x faster
genshi_text	24.3 ms	23.3 ms: 1.04x faster
genshi_xml	56.4 ms	52.8 ms: 1.07x faster
go	140 ms	136 ms: 1.03x faster
hexiom	6.40 ms	6.25 ms: 1.02x faster
html5lib	65.0 ms	60.7 ms: 1.07x faster
json_dumps	11.1 ms	10.4 ms: 1.07x faster
json_loads	28.7 us	26.3 us: 1.09x faster
logging_format	7.29 us	6.69 us: 1.09x faster
logging_silent	101 ns	97.6 ns: 1.03x faster
logging_simple	6.48 us	6.01 us: 1.08x faster
mako	10.6 ms	9.91 ms: 1.07x faster
meteor_contest	106 ms	102 ms: 1.04x faster
nbody	86.4 ms	87.7 ms: 1.02x slower
nqueens	91.3 ms	88.1 ms: 1.04x faster
pathlib	19.0 ms	16.8 ms: 1.13x faster
pickle_dict	32.2 us	32.6 us: 1.01x slower
pickle_list	4.69 us	4.62 us: 1.02x faster
pickle_pure_python	297 us	282 us: 1.05x faster
pidigits	177 ms	176 ms: 1.01x faster
pyflate	423 ms	416 ms: 1.02x faster
python_startup	8.72 ms	8.15 ms: 1.07x faster
python_startup_no_site	6.35 ms	5.97 ms: 1.06x faster
raytrace	312 ms	293 ms: 1.06x faster
regex_compile	139 ms	131 ms: 1.06x faster
regex_dna	180 ms	185 ms: 1.03x slower
regex_effbot	2.99 ms	2.82 ms: 1.06x faster
regex_v8	21.4 ms	20.4 ms: 1.05x faster
richards	48.6 ms	46.3 ms: 1.05x faster
scimark_fft	348 ms	338 ms: 1.03x faster
scimark_lu	120 ms	117 ms: 1.02x faster
scimark_monte_carlo	67.0 ms	65.4 ms: 1.02x faster
scimark_sor	116 ms	113 ms: 1.02x faster
spectral_norm	101 ms	102 ms: 1.01x slower
sqlalchemy_declarative	143 ms	135 ms: 1.06x faster
sqlalchemy_imperative	19.0 ms	17.0 ms: 1.12x faster
sqlite_synth	2.50 us	2.29 us: 1.09x faster
sympy_expand	507 ms	465 ms: 1.09x faster
sympy_integrate	21.7 ms	20.5 ms: 1.06x faster
sympy_sum	176 ms	164 ms: 1.08x faster
sympy_str	311 ms	286 ms: 1.09x faster
telco	7.02 ms	6.36 ms: 1.10x faster
tornado_http	125 ms	113 ms: 1.10x faster
unpickle	15.7 us	15.1 us: 1.04x faster
unpickle_list	4.74 us	4.56 us: 1.04x faster
unpickle_pure_python	229 us	219 us: 1.05x faster
xml_etree_parse	158 ms	155 ms: 1.02x faster
xml_etree_iterparse	103 ms	101 ms: 1.02x faster
xml_etree_generate	91.0 ms	84.3 ms: 1.08x faster
xml_etree_process	61.9 ms	58.4 ms: 1.06x faster
Geometric mean	(ref)	1.05x faster

Benchmark hidden because not significant (3): pickle, scimark_sparse_mat_mult, unpack_sequence

corona10 · 2022-08-14T08:00:01Z

Another benchmark from Azure VM(Ubuntu 20.04.4 LTS gcc 9.4.0):
https://gist.github.com/corona10/c2aa0108a5ffcc96be449c0ce033412d

But let's measure the benchmark from the Faster CPython machine after the PR is merged.

corona10 · 2022-08-15T02:59:24Z

Makefile.pre.in

@@ -640,6 +640,15 @@ profile-opt: profile-run-stamp
 	-rm -f profile-clean-stamp
 	$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)"

+bolt-opt: @PREBOLT_RULE@
+	rm -f *.fdata
+	@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst


Suggested change

@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

corona10 · 2022-08-15T02:59:24Z

Makefile.pre.in

+	@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
+	./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true
+	@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata
+	@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot


Suggested change

@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

corona10 · 2022-08-15T04:45:49Z

I success to get cache miss-related metadata and also I got the pyperformance result which is similar to my previous attempts and Kevin's report.
I didn't analyze whether the GCC version or OS version could affect the performance result.
But I can conclude that BOLT definitely makes CPython faster.

Environment

Hardware: AWS c5n.metal
Red Hat Enterprise Linux release 8.6 (Ootpa)
gcc: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10)
LLVM version 14.0.6

Binary Size

Without BOLT: 79M
With BOLT: 36M

ICache miss

Experiment	instructions	L1-icache-misses	ratio
PGO + LTO	8,330,863,079,932	77,047,357,163	0.92%
PGO + LTO + BOLT	8,312,698,165,975	65,319,225,064	0.79%

Benchmark (1.01x faster)

https://gist.github.com/corona10/5726d1528176677d4c694265edfc4bf5

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

kmod · 2022-08-16T21:47:34Z

Thanks for taking a look! Yes many of the Pyston macrobenchmarks broke in 3.11 but it looks like @mdboom is currently working updating the dependencies to versions that are compatible with 3.11.

I have made the requested changes; please review again

bedevere-bot · 2022-08-16T21:47:37Z

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

corona10 · 2022-08-16T23:49:17Z

Doc/using/configure.rst

+Configuring Python using ``--enable-optimizations --with-lto --enable-bolt``
+(PGO + LTO + BOLT) is recommended for best performance.


Last one:
Let's be conservative, I would like to introduce the BOLT option as experimental for this time.
I wish that we can change this sentence in the future version.

Suggested change

Configuring Python using ``--enable-optimizations --with-lto --enable-bolt``

(PGO + LTO + BOLT) is recommended for best performance.

Configuring Python using ``--enable-optimizations --with-lto``

(PGO + LTO) is recommended for optimal performance.

I am okay to introduce (PGO + LTO + BOLT) as an experimental combination. So it is okay to introduce both of them.

I agree, also having BOLT installed is an extra requirement that most users won't have, so I would not recommend to advertise it with the other two options that just depend on the compiler toolchain.

Makes sense! What do you think of this new wording?

I think is good! Maybe add a link to some more detailed instructions?

@pablogsal

BOLT is part of the LLVM project but is not always included in their binary
distributions. This flag requires that llvm-bolt and merge-fdata
are available

I think that this sentence is enough. In near future, BOLT will be included in LLVM binary distributions by default.
Detail installation guide will be changed up to their situation, so the Iink can be broken anytime.
WDYT?

Makes sense. What do you think about adding a link to the BOLT webpage or repo?

Makes sense. What do you think about adding a link to the BOLT webpage or repo?

@kmod
I think that we can add the link to https://github.com/llvm/llvm-project/tree/main/bolt since there is no official page for BOLT under llvm.org, would you like to add it to cmdoption:: --enable-bolt section? (Or you can link the better page such as https://github.com/facebookincubator/BOLT, I am not sure which page is better)
Thanks for your hard work!

bedevere-bot · 2022-08-16T23:49:20Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

corona10

LGTM

I will merge this PR after listening to @pablogsal 's opinion with #95908 (comment).

bedevere-bot added the awaiting review label Aug 11, 2022

corona10 self-requested a review Aug 11, 2022

corona10 changed the title ~~Add support for the BOLT post-link binary optimizer~~ gh-90536: Add support for the BOLT post-link binary optimizer Aug 11, 2022

Simplify the build flags

1448a68

Add a NEWS entry

c546374

This comment has been hidden.

Sign in to view

corona10 reviewed Aug 13, 2022

View changes

corona10 requested changes Aug 13, 2022

View changes

Makefile.pre.in Show resolved Hide resolved

configure.ac Outdated Show resolved Hide resolved

bedevere-bot removed the awaiting review label Aug 13, 2022

bedevere-bot added the awaiting changes label Aug 13, 2022

corona10 self-assigned this Aug 13, 2022

corona10 reviewed Aug 15, 2022

View changes

kmod and others added 6 commits Aug 16, 2022

Update Makefile.pre.in

c12dbea

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

Update configure.ac

ce25757

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

Add myself to ACKS

ded38f0

Add docs

cc17806

Other review comments

0050190

fix tab/space issue

83da8c4

bedevere-bot added awaiting change review and removed awaiting changes labels Aug 16, 2022

bedevere-bot requested a review from corona10 Aug 16, 2022

corona10 requested changes Aug 16, 2022

View changes

bedevere-bot removed the awaiting change review label Aug 16, 2022

bedevere-bot added the awaiting changes label Aug 16, 2022

Make it more clear that --enable-bolt is experimental

7f14cd1

corona10 approved these changes Aug 17, 2022

View changes

bedevere-bot added awaiting merge and removed awaiting changes labels Aug 17, 2022

corona10 closed this Aug 18, 2022

corona10 reopened this Aug 18, 2022

gh-90536: Add support for the BOLT post-link binary optimizer #95908

gh-90536: Add support for the BOLT post-link binary optimizer #95908

kmod commented Aug 11, 2022 •

edited by bedevere-bot

bedevere-bot commented Aug 11, 2022

gvanrossum commented Aug 11, 2022

pablogsal commented Aug 11, 2022

corona10 commented Aug 11, 2022

bedevere-bot commented Aug 12, 2022

This comment has been hidden.

corona10 left a comment •

edited

corona10 commented Aug 13, 2022

corona10 commented Aug 13, 2022 •

edited

corona10 left a comment •

edited

bedevere-bot commented Aug 13, 2022

corona10 commented Aug 13, 2022 •

edited

corona10 commented Aug 14, 2022 •

edited

corona10 Aug 15, 2022

corona10 Aug 15, 2022

corona10 commented Aug 15, 2022 •

edited

kmod commented Aug 16, 2022

bedevere-bot commented Aug 16, 2022

corona10 Aug 16, 2022 •

edited

corona10 Aug 17, 2022

pablogsal Aug 17, 2022

kmod Aug 17, 2022

pablogsal Aug 17, 2022

corona10 Aug 17, 2022 •

edited

pablogsal Aug 18, 2022

corona10 Aug 18, 2022 •

edited

bedevere-bot commented Aug 16, 2022

corona10 left a comment

	@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
	@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

	@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
	@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

		Configuring Python using ``--enable-optimizations --with-lto --enable-bolt``
		(PGO + LTO + BOLT) is recommended for best performance.

gh-90536: Add support for the BOLT post-link binary optimizer #95908

Are you sure you want to change the base?

gh-90536: Add support for the BOLT post-link binary optimizer #95908

Conversation

kmod commented Aug 11, 2022 • edited by bedevere-bot

bedevere-bot commented Aug 11, 2022

gvanrossum commented Aug 11, 2022

pablogsal commented Aug 11, 2022

corona10 commented Aug 11, 2022

bedevere-bot commented Aug 12, 2022

This comment has been hidden.

corona10 left a comment • edited

corona10 commented Aug 13, 2022

corona10 commented Aug 13, 2022 • edited

corona10 left a comment • edited

bedevere-bot commented Aug 13, 2022

corona10 commented Aug 13, 2022 • edited

corona10 commented Aug 14, 2022 • edited

corona10 Aug 15, 2022

Choose a reason for hiding this comment

corona10 Aug 15, 2022

Choose a reason for hiding this comment

corona10 commented Aug 15, 2022 • edited

Environment

Binary Size

ICache miss

Benchmark (1.01x faster)

kmod commented Aug 16, 2022

bedevere-bot commented Aug 16, 2022

corona10 Aug 16, 2022 • edited

Choose a reason for hiding this comment

corona10 Aug 17, 2022

Choose a reason for hiding this comment

pablogsal Aug 17, 2022

Choose a reason for hiding this comment

kmod Aug 17, 2022

Choose a reason for hiding this comment

pablogsal Aug 17, 2022

Choose a reason for hiding this comment

corona10 Aug 17, 2022 • edited

Choose a reason for hiding this comment

pablogsal Aug 18, 2022

Choose a reason for hiding this comment

corona10 Aug 18, 2022 • edited

Choose a reason for hiding this comment

bedevere-bot commented Aug 16, 2022

corona10 left a comment

kmod commented Aug 11, 2022 •

edited by bedevere-bot

corona10 left a comment •

edited

corona10 commented Aug 13, 2022 •

edited

corona10 left a comment •

edited

corona10 commented Aug 13, 2022 •

edited

corona10 commented Aug 14, 2022 •

edited

corona10 commented Aug 15, 2022 •

edited

corona10 Aug 16, 2022 •

edited

corona10 Aug 17, 2022 •

edited

corona10 Aug 18, 2022 •

edited