Enable arm64 optimizations that exist for power/x86 #3393
Conversation
64-bit Arm platforms support unaligned accesses. Running the string benchmarks this change improves performance by an average of 1.04x, min .96x, max 1.21x, median 1.01x
Similar to x86 and powerpc optimizations. | |compare-ruby|built-ruby| |:------|-----------:|---------:| |hash1 | 0.225| 0.237| | | -| 1.05x| |hash2 | 0.110| 0.110| | | 1.00x| -|
| |compare-ruby|built-ruby| |:------------------------------|-----------:|---------:| |vm_array | 26.501M| 27.959M| | | -| 1.06x| |vm_attr_ivar | 21.606M| 31.429M| | | -| 1.45x| |vm_attr_ivar_set | 21.178M| 26.113M| | | -| 1.23x| |vm_backtrace | 6.621| 6.668| | | -| 1.01x| |vm_bigarray | 26.205M| 29.958M| | | -| 1.14x| |vm_bighash | 504.155k| 479.306k| | | 1.05x| -| |vm_block | 16.692M| 21.315M| | | -| 1.28x| |block_handler_type_iseq | 5.083| 7.004| | | -| 1.38x|
#elif defined(__GNUC__) && defined(__aarch64__) | ||
DECL_SC_REG(const VALUE *, pc, "19"); | ||
DECL_SC_REG(rb_control_frame_t *, cfp, "20"); | ||
#define USE_MACHINE_REGS 1 | ||
|
shyouhei
Aug 6, 2020
•
Member
Does this really benefit? We know that recent compilers are smarter than they were when we wrote those sibling codes. Read more: https://bugs.ruby-lang.org/issues/12225
cc @nurse
AGSaidi
Aug 6, 2020
Author
Contributor
@shyouhei the only changes between compare-ruby and built-ruby in the number in the commit message above are the two hunks in vm_exec.c. I'm happy to run other benchmarks if you'd like, but it appears to improve substantially. Double checked my result again by removing all diffs and comparing to the ruby I built prior to my patches. The results were +-2% and then reapplied these two hunks and re-ran again, and observed the improvements here (up to 1.38x).
nurse
Aug 6, 2020
Member
As far as I remember, there're another example with clang which says it's still effective.
And the commit comment says 1.2x seems worth introducing this change.
shyouhei
Aug 7, 2020
Member
OK then, we need to investigate what is going on but this pull request can be a separate thing.
Enable a set of optimizations that exist already for power and x86 for aarch64/arm64 systems.
Passes make check after these changes.
Running the string benchmarks the unaligned access change improves performance
by an average of 1.04x, min .96x, max 1.21x, median 1.01x
The gc optimization improves benchmark/gc/hash1 by 5%
The vm_exec changes make a massive difference on some benchmarks (e.g. 1.38x).