Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing FOR_ITER bytecode #91432

Open
sweeneyde opened this issue Apr 10, 2022 · 13 comments
Open

Optimizing FOR_ITER bytecode #91432

sweeneyde opened this issue Apr 10, 2022 · 13 comments
Labels
3.11 interpreter-core performance

Comments

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Apr 10, 2022

We can execute one less opcode in every iteration (except for the first) of every for-loop by changing from

    GET_ITER
top:
    FOR_ITER
    [body]
    JUMP_BACKWARDS(top)
cleanup:
end:

to

    GET_ITER
    JUMP_FORWARD(bottom)
body:
    [body]
bottom:
    FOR_END(body)
cleanup:
end:

This was suggested by @markshannon here, but it appears to be similar in spirit to Loop Inversion.

There seems to be a small (on the order of 1%) benefit, but I imagine the benefit will be magnified after any specialization.

@sweeneyde sweeneyde added interpreter-core 3.11 performance labels Apr 11, 2022
@markshannon
Copy link
Member

@markshannon markshannon commented Apr 11, 2022

What about?

top:
    FOR_ITER
body:
    [body]
    END_ITER body
cleanup:
end:

@markshannon
Copy link
Member

@markshannon markshannon commented Apr 11, 2022

Also, we would like to decouple the sense (jumps on exhaustion like FOR_ITER, or jumps when not exhausted like END_FOR) and the direction of the jump.

@markshannon
Copy link
Member

@markshannon markshannon commented Apr 11, 2022

If you only want one sense of jump, and to avoid the extra backwards branch, this should work (provided we have both a forward and backward version)

top:
    END_FOR body
    JUMP cleanup
body:
    [body]
    END_FOR body
cleanup:
end:

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 11, 2022

What is the advantage of having a forward version or a jump-if-exhausted version when the vast majority of executions of the opcode will be (backwards, jump-if-not-exhausted)? Is that symmetry worth the extra code in ceval loop?

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 12, 2022

Version with a JUMP_FORWARD to FOR_END, no FOR_ITER: #70016

Version with both a FOR_ITER and a FOR_END: #91472

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 12, 2022

I proposed a similar idea (opcode FOR_NEXT) in the discussion for #71314. At that time it was rejected because it makes the relation between Python code and the generated bytecode less straightforward. Seems the bar is lower now, so I considered to return to this.

I have not implemented it because:

  1. It would require rewriting the code for the frame.f_lineno setter, which was already too complicated and brittle.
  2. It would negatively affect the idiom: for x in a: break.
  3. The benefit was unclear.

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 12, 2022

What if we specialized JUMP_BACKWARD to FOR_END_ADAPTIVE when targeted at FOR_ITER?

Then we get the same old readable unquickened bytecode but reduce the number of dispatches per for-loop for quickened bytecode? It would add cache entries to JUMP_BACKWARD, but perhaps we could eventually limit that to only the targeted-at-FOR_ITER cases.

@sweeneyde

This comment was marked as outdated.

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 21, 2022

After merging up with main and re-running everything, I get these results:

Microbenchmarks:

Benchmark main_micro for_iter_micro jump_backward_micro
for_range 20 8.05 ns 6.54 ns: 1.23x faster 5.62 ns: 1.43x faster
for_range 200 6.33 ns 4.28 ns: 1.48x faster 4.00 ns: 1.58x faster
for_range 2_000 10.7 ns 3.97 ns: 2.68x faster 3.83 ns: 2.79x faster
for_range 20_000 11.3 ns 3.90 ns: 2.89x faster 3.78 ns: 2.97x faster
for_range 200_000 11.4 ns 3.90 ns: 2.93x faster 3.79 ns: 3.02x faster
for_list 20 6.71 ns 5.97 ns: 1.12x faster 5.64 ns: 1.19x faster
for_list 200 5.41 ns 4.63 ns: 1.17x faster 4.31 ns: 1.25x faster
for_list 2_000 5.16 ns 4.54 ns: 1.14x faster 4.07 ns: 1.27x faster
for_list 20_000 5.23 ns 4.57 ns: 1.14x faster 4.12 ns: 1.27x faster
for_list 200_000 5.86 ns 5.09 ns: 1.15x faster 4.70 ns: 1.25x faster
Geometric mean (ref) 1.55x faster 1.67x faster

Pyperformance

All together in one table:

Benchmark main_pyperformance for_iter_pyperformance for_end_pyperformance
chameleon 6.29 ms 6.39 ms: 1.02x slower 6.17 ms: 1.02x faster
chaos 71.1 ms 70.3 ms: 1.01x faster 72.8 ms: 1.02x slower
deltablue 3.69 ms not significant 3.74 ms: 1.01x slower
django_template 36.7 ms not significant 37.9 ms: 1.03x slower
dulwich_log 76.9 ms not significant 76.1 ms: 1.01x faster
fannkuch 391 ms 369 ms: 1.06x faster 387 ms: 1.01x faster
float 74.6 ms 70.2 ms: 1.06x faster 71.2 ms: 1.05x faster
go 141 ms 137 ms: 1.03x faster 137 ms: 1.03x faster
hexiom 6.40 ms 6.26 ms: 1.02x faster 6.24 ms: 1.03x faster
html5lib 65.3 ms 64.2 ms: 1.02x faster 63.5 ms: 1.03x faster
json_loads 27.6 us not significant 27.1 us: 1.02x faster
logging_format 7.15 us 7.25 us: 1.01x slower 7.24 us: 1.01x slower
logging_simple 6.39 us 6.52 us: 1.02x slower not significant
mako 9.78 ms 9.51 ms: 1.03x faster 9.33 ms: 1.05x faster
meteor_contest 105 ms not significant 107 ms: 1.02x slower
nbody 90.3 ms 83.4 ms: 1.08x faster not significant
nqueens 81.0 ms 79.2 ms: 1.02x faster 83.1 ms: 1.03x slower
pathlib 19.3 ms 19.1 ms: 1.01x faster 18.9 ms: 1.02x faster
pickle_dict 28.9 us 28.3 us: 1.02x faster 28.4 us: 1.02x faster
pidigits 184 ms not significant 182 ms: 1.01x faster
pyflate 409 ms 417 ms: 1.02x slower not significant
python_startup 8.84 ms 8.81 ms: 1.00x faster not significant
python_startup_no_site 6.63 ms 6.62 ms: 1.00x faster 6.61 ms: 1.00x faster
raytrace 297 ms 303 ms: 1.02x slower 305 ms: 1.03x slower
regex_compile 139 ms 136 ms: 1.02x faster 137 ms: 1.01x faster
regex_dna 196 ms 200 ms: 1.02x slower not significant
regex_effbot 2.72 ms 2.89 ms: 1.06x slower 2.79 ms: 1.03x slower
regex_v8 22.1 ms 22.5 ms: 1.02x slower 21.5 ms: 1.03x faster
richards 50.2 ms 52.1 ms: 1.04x slower 51.8 ms: 1.03x slower
scimark_fft 335 ms 326 ms: 1.03x faster 328 ms: 1.02x faster
scimark_lu 98.9 ms not significant 101 ms: 1.02x slower
scimark_monte_carlo 66.4 ms 64.2 ms: 1.03x faster 63.9 ms: 1.04x faster
scimark_sor 109 ms 107 ms: 1.02x faster 107 ms: 1.02x faster
scimark_sparse_mat_mult 5.09 ms 4.62 ms: 1.10x faster 4.29 ms: 1.19x faster
spectral_norm 93.6 ms 90.4 ms: 1.04x faster not significant
sqlalchemy_declarative 147 ms 144 ms: 1.02x faster 145 ms: 1.01x faster
telco 6.34 ms 6.14 ms: 1.03x faster 6.41 ms: 1.01x slower
tornado_http 178 ms 173 ms: 1.03x faster 175 ms: 1.02x faster
unpack_sequence 41.9 ns 43.5 ns: 1.04x slower 44.9 ns: 1.07x slower
unpickle_pure_python 230 us 234 us: 1.01x slower not significant
xml_etree_parse 157 ms 160 ms: 1.02x slower 159 ms: 1.02x slower
xml_etree_generate 74.1 ms not significant 73.4 ms: 1.01x faster
Geometric mean (ref) 1.01x faster 1.00x faster

Just looking at the FOR_ITER specialization:

FOR_ITER specialization:

Slower (12):
- regex_effbot: 2.72 ms +- 0.07 ms -> 2.89 ms +- 0.09 ms: 1.06x slower
- unpack_sequence: 41.9 ns +- 1.5 ns -> 43.5 ns +- 1.9 ns: 1.04x slower
- richards: 50.2 ms +- 2.4 ms -> 52.1 ms +- 2.8 ms: 1.04x slower
- xml_etree_parse: 157 ms +- 6 ms -> 160 ms +- 3 ms: 1.02x slower
- logging_simple: 6.39 us +- 0.12 us -> 6.52 us +- 0.33 us: 1.02x slower
- pyflate: 409 ms +- 7 ms -> 417 ms +- 5 ms: 1.02x slower
- raytrace: 297 ms +- 8 ms -> 303 ms +- 8 ms: 1.02x slower
- regex_v8: 22.1 ms +- 0.3 ms -> 22.5 ms +- 0.8 ms: 1.02x slower
- regex_dna: 196 ms +- 2 ms -> 200 ms +- 6 ms: 1.02x slower
- chameleon: 6.29 ms +- 0.18 ms -> 6.39 ms +- 0.17 ms: 1.02x slower
- logging_format: 7.15 us +- 0.17 us -> 7.25 us +- 0.16 us: 1.01x slower
- unpickle_pure_python: 230 us +- 4 us -> 234 us +- 4 us: 1.01x slower

Faster (22):
- scimark_sparse_mat_mult: 5.09 ms +- 0.25 ms -> 4.62 ms +- 0.18 ms: 1.10x faster
- nbody: 90.3 ms +- 1.8 ms -> 83.4 ms +- 2.9 ms: 1.08x faster
- float: 74.6 ms +- 1.3 ms -> 70.2 ms +- 1.4 ms: 1.06x faster
- fannkuch: 391 ms +- 5 ms -> 369 ms +- 5 ms: 1.06x faster
- spectral_norm: 93.6 ms +- 1.1 ms -> 90.4 ms +- 1.7 ms: 1.04x faster
- scimark_monte_carlo: 66.4 ms +- 2.0 ms -> 64.2 ms +- 1.0 ms: 1.03x faster
- telco: 6.34 ms +- 0.16 ms -> 6.14 ms +- 0.13 ms: 1.03x faster
- scimark_fft: 335 ms +- 13 ms -> 326 ms +- 9 ms: 1.03x faster
- mako: 9.78 ms +- 0.12 ms -> 9.51 ms +- 0.13 ms: 1.03x faster
- go: 141 ms +- 12 ms -> 137 ms +- 3 ms: 1.03x faster
- tornado_http: 178 ms +- 7 ms -> 173 ms +- 7 ms: 1.03x faster
- regex_compile: 139 ms +- 3 ms -> 136 ms +- 2 ms: 1.02x faster
- nqueens: 81.0 ms +- 1.9 ms -> 79.2 ms +- 2.5 ms: 1.02x faster
- hexiom: 6.40 ms +- 0.10 ms -> 6.26 ms +- 0.16 ms: 1.02x faster
- pickle_dict: 28.9 us +- 0.7 us -> 28.3 us +- 0.8 us: 1.02x faster
- sqlalchemy_declarative: 147 ms +- 6 ms -> 144 ms +- 4 ms: 1.02x faster
- scimark_sor: 109 ms +- 1 ms -> 107 ms +- 1 ms: 1.02x faster
- html5lib: 65.3 ms +- 2.6 ms -> 64.2 ms +- 2.6 ms: 1.02x faster
- chaos: 71.1 ms +- 2.4 ms -> 70.3 ms +- 2.2 ms: 1.01x faster
- pathlib: 19.3 ms +- 0.4 ms -> 19.1 ms +- 0.4 ms: 1.01x faster
- python_startup: 8.84 ms +- 0.07 ms -> 8.81 ms +- 0.07 ms: 1.00x faster
- python_startup_no_site: 6.63 ms +- 0.05 ms -> 6.62 ms +- 0.07 ms: 1.00x faster

Benchmark hidden because not significant (25): 2to3, crypto_pyaes, deltablue, django_template, dulwich_log, json_dumps, json_loads, logging_silent, meteor_contest, pickle, pickle_list, pickle_pure_python, pidigits, scimark_lu, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, unpickle, unpickle_list, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

Just looking at the JUMP_BACKWARD specialization:

JUMP_BACKWARD specialization:

Slower (13):
- unpack_sequence: 41.9 ns +- 1.5 ns -> 44.9 ns +- 1.1 ns: 1.07x slower
- django_template: 36.7 ms +- 2.4 ms -> 37.9 ms +- 0.9 ms: 1.03x slower
- richards: 50.2 ms +- 2.4 ms -> 51.8 ms +- 3.2 ms: 1.03x slower
- regex_effbot: 2.72 ms +- 0.07 ms -> 2.79 ms +- 0.04 ms: 1.03x slower
- raytrace: 297 ms +- 8 ms -> 305 ms +- 6 ms: 1.03x slower
- nqueens: 81.0 ms +- 1.9 ms -> 83.1 ms +- 1.3 ms: 1.03x slower
- chaos: 71.1 ms +- 2.4 ms -> 72.8 ms +- 2.0 ms: 1.02x slower
- meteor_contest: 105 ms +- 3 ms -> 107 ms +- 2 ms: 1.02x slower
- scimark_lu: 98.9 ms +- 2.1 ms -> 101 ms +- 2 ms: 1.02x slower
- xml_etree_parse: 157 ms +- 6 ms -> 159 ms +- 3 ms: 1.02x slower
- logging_format: 7.15 us +- 0.17 us -> 7.24 us +- 0.29 us: 1.01x slower
- deltablue: 3.69 ms +- 0.11 ms -> 3.74 ms +- 0.09 ms: 1.01x slower
- telco: 6.34 ms +- 0.16 ms -> 6.41 ms +- 0.18 ms: 1.01x slower

Faster (22):
- scimark_sparse_mat_mult: 5.09 ms +- 0.25 ms -> 4.29 ms +- 0.18 ms: 1.19x faster
- mako: 9.78 ms +- 0.12 ms -> 9.33 ms +- 0.11 ms: 1.05x faster
- float: 74.6 ms +- 1.3 ms -> 71.2 ms +- 1.3 ms: 1.05x faster
- scimark_monte_carlo: 66.4 ms +- 2.0 ms -> 63.9 ms +- 1.0 ms: 1.04x faster
- go: 141 ms +- 12 ms -> 137 ms +- 3 ms: 1.03x faster
- regex_v8: 22.1 ms +- 0.3 ms -> 21.5 ms +- 0.2 ms: 1.03x faster
- html5lib: 65.3 ms +- 2.6 ms -> 63.5 ms +- 2.7 ms: 1.03x faster
- hexiom: 6.40 ms +- 0.10 ms -> 6.24 ms +- 0.11 ms: 1.03x faster
- scimark_fft: 335 ms +- 13 ms -> 328 ms +- 11 ms: 1.02x faster
- pathlib: 19.3 ms +- 0.4 ms -> 18.9 ms +- 0.4 ms: 1.02x faster
- chameleon: 6.29 ms +- 0.18 ms -> 6.17 ms +- 0.14 ms: 1.02x faster
- pickle_dict: 28.9 us +- 0.7 us -> 28.4 us +- 0.6 us: 1.02x faster
- json_loads: 27.6 us +- 1.2 us -> 27.1 us +- 1.3 us: 1.02x faster
- scimark_sor: 109 ms +- 1 ms -> 107 ms +- 2 ms: 1.02x faster
- tornado_http: 178 ms +- 7 ms -> 175 ms +- 6 ms: 1.02x faster
- sqlalchemy_declarative: 147 ms +- 6 ms -> 145 ms +- 4 ms: 1.01x faster
- pidigits: 184 ms +- 1 ms -> 182 ms +- 1 ms: 1.01x faster
- regex_compile: 139 ms +- 3 ms -> 137 ms +- 2 ms: 1.01x faster
- dulwich_log: 76.9 ms +- 1.2 ms -> 76.1 ms +- 1.6 ms: 1.01x faster
- xml_etree_generate: 74.1 ms +- 0.8 ms -> 73.4 ms +- 1.0 ms: 1.01x faster
- fannkuch: 391 ms +- 5 ms -> 387 ms +- 7 ms: 1.01x faster
- python_startup_no_site: 6.63 ms +- 0.05 ms -> 6.61 ms +- 0.06 ms: 1.00x faster

Benchmark hidden because not significant (24): 2to3, crypto_pyaes, json_dumps, logging_silent, logging_simple, nbody, pickle, pickle_list, pickle_pure_python, pyflate, python_startup, regex_dna, spectral_norm, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_sum, sympy_str, unpickle, unpickle_list, unpickle_pure_python, xml_etree_iterparse, xml_etree_process

Geometric mean: 1.00x faster

Comparing the two:

Specializing FOR_ITER --> Specializing JUMP_BACKWARD

Slower (15):
- nbody: 83.4 ms +- 2.9 ms -> 90.4 ms +- 2.1 ms: 1.08x slower
- fannkuch: 369 ms +- 5 ms -> 387 ms +- 7 ms: 1.05x slower
- nqueens: 79.2 ms +- 2.5 ms -> 83.1 ms +- 1.3 ms: 1.05x slower
- spectral_norm: 90.4 ms +- 1.7 ms -> 94.5 ms +- 8.0 ms: 1.05x slower
- telco: 6.14 ms +- 0.13 ms -> 6.41 ms +- 0.18 ms: 1.04x slower
- chaos: 70.3 ms +- 2.2 ms -> 72.8 ms +- 2.0 ms: 1.04x slower
- unpack_sequence: 43.5 ns +- 1.9 ns -> 44.9 ns +- 1.1 ns: 1.03x slower
- scimark_lu: 98.5 ms +- 1.7 ms -> 101 ms +- 2 ms: 1.02x slower
- meteor_contest: 105 ms +- 2 ms -> 107 ms +- 2 ms: 1.02x slower
- django_template: 37.3 ms +- 1.5 ms -> 37.9 ms +- 0.9 ms: 1.02x slower
- float: 70.2 ms +- 1.4 ms -> 71.2 ms +- 1.3 ms: 1.01x slower
- sympy_sum: 170 ms +- 3 ms -> 172 ms +- 6 ms: 1.01x slower
- regex_compile: 136 ms +- 2 ms -> 137 ms +- 2 ms: 1.01x slower
- raytrace: 303 ms +- 8 ms -> 305 ms +- 6 ms: 1.01x slower
- python_startup: 8.81 ms +- 0.07 ms -> 8.84 ms +- 0.07 ms: 1.00x slower

Faster (14):
- scimark_sparse_mat_mult: 4.62 ms +- 0.18 ms -> 4.29 ms +- 0.18 ms: 1.08x faster
- regex_v8: 22.5 ms +- 0.8 ms -> 21.5 ms +- 0.2 ms: 1.05x faster
- chameleon: 6.39 ms +- 0.17 ms -> 6.17 ms +- 0.14 ms: 1.04x faster
- regex_effbot: 2.89 ms +- 0.09 ms -> 2.79 ms +- 0.04 ms: 1.03x faster
- unpickle: 16.1 us +- 0.7 us -> 15.5 us +- 1.2 us: 1.03x faster
- pyflate: 417 ms +- 5 ms -> 407 ms +- 6 ms: 1.02x faster
- regex_dna: 200 ms +- 6 ms -> 195 ms +- 3 ms: 1.02x faster
- logging_simple: 6.52 us +- 0.33 us -> 6.38 us +- 0.13 us: 1.02x faster
- mako: 9.51 ms +- 0.13 ms -> 9.33 ms +- 0.11 ms: 1.02x faster
- json_dumps: 12.8 ms +- 0.4 ms -> 12.6 ms +- 0.3 ms: 1.01x faster
- xml_etree_generate: 74.3 ms +- 0.9 ms -> 73.4 ms +- 1.0 ms: 1.01x faster
- pathlib: 19.1 ms +- 0.4 ms -> 18.9 ms +- 0.4 ms: 1.01x faster
- dulwich_log: 77.0 ms +- 1.6 ms -> 76.1 ms +- 1.6 ms: 1.01x faster
- unpickle_pure_python: 234 us +- 4 us -> 231 us +- 6 us: 1.01x faster

Benchmark hidden because not significant (30): 2to3, crypto_pyaes, deltablue, go, hexiom, html5lib, json_loads, logging_format, logging_silent, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup_no_site, richards, scimark_fft, scimark_monte_carlo, scimark_sor, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_process

Geometric mean: 1.00x slower

@sweeneyde sweeneyde changed the title Replace JUMP+FOR_ITER with FOR_END Optimizing FOR_ITER Apr 24, 2022
@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 24, 2022

Changing the issue title to be more general and include specializations without any FOR_END.

@gahabana
Copy link

@gahabana gahabana commented Jun 2, 2022

3.11 beta has much slower for loop with range then 3.10 or 3.9

this piece of code:

i: int
for i in range (0,100_000_000):
    ...

executes 20% slower with 3.11-dev, 3.11.0b1 as well as 3.11.0b3

I am not sure if this is related to above discussion

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Jun 2, 2022

@gahabana No changes have been made relating to this issue, so please open a new issue about a particular performance regression. Make sure to describe your operating system and how you installed python.

@sweeneyde sweeneyde changed the title Optimizing FOR_ITER Optimizing FOR_ITER bytecode Jun 2, 2022
@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Jun 12, 2022

Some more microbenchmark (empty for-loop) results for other FOR_ITER specializations:

  • Can get lists about 1.1x faster
  • Can get ranges between 1.5x and 3x faster by combining the STORE_FAST
  • Dict items can get about 1.5x faster by combining the UNPACK_SEQUENCE(2)
  • Enumerate can get about 1.25x faster by combining the UNPACK_SEQUENCE(2)
  • Map can get about 1.2x faster by focusing on the one-iterable case
  • I couldn't manage to get looping over sets any faster.
  • I got loops over tuples about 1.08x faster

I collected opcode stats here a couple months ago, and I can't imagine much has changed since then.

If we manage to merge some version of #91713 (specializing list and range), I would think to do dict items and enumerate next.

markshannon pushed a commit that referenced this issue Jun 21, 2022
* Adds FOR_ITER_LIST and FOR_ITER_RANGE specializations.

* Adds _PyLong_AssignValue() internal function to avoid temporary boxing of ints.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 interpreter-core performance
Projects
None yet
Development

No branches or pull requests

4 participants