Skip to content

gh-100726: optimize construction of range object for medium sized integers #100727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Jan 3, 2023

The python range object contains code to compute the length from the start, stop and step values. This calculation is performed with PyLong objects which is expensive. This PR adds a fast path for the common case where start, stop and step all fit into a long value.

Benchmark results:

range(0, 5): Mean +- std dev: [main_range] 110 ns +- 1 ns -> [pr_range] 56.8 ns +- 1.3 ns: 1.93x faster
list(range(0, 5)): Mean +- std dev: [main_range] 243 ns +- 1 ns -> [pr_range] 181 ns +- 3 ns: 1.34x faster
range(-2, 3): Mean +- std dev: [main_range] 110 ns +- 2 ns -> [pr_range] 56.6 ns +- 1.0 ns: 1.94x faster
list(range(-2, 3)): Mean +- std dev: [main_range] 244 ns +- 3 ns -> [pr_range] 182 ns +- 4 ns: 1.34x faster
range(2, 60): Mean +- std dev: [main_range] 112 ns +- 1 ns -> [pr_range] 57.1 ns +- 1.4 ns: 1.96x faster
list(range(2, 60)): Mean +- std dev: [main_range] 535 ns +- 28 ns -> [pr_range] 472 ns +- 27 ns: 1.13x faster
range(2, 60, 2): Mean +- std dev: [main_range] 115 ns +- 1 ns -> [pr_range] 59.4 ns +- 1.4 ns: 1.94x faster
list(range(2, 60, 2)): Mean +- std dev: [main_range] 382 ns +- 11 ns -> [pr_range] 317 ns +- 5 ns: 1.21x faster
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: [main_range] 136 ns +- 1 ns -> [pr_range] 137 ns +- 3 ns: 1.01x slower
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: [main_range] 444 ns +- 3 ns -> [pr_range] 440 ns +- 6 ns: 1.01x faster
for loop with range): Mean +- std dev: [main_range] 878 ns +- 12 ns -> [pr_range] 815 ns +- 21 ns: 1.08x faster

Geometric mean: 1.39x faster
Benchmark script
import pyperf
runner = pyperf.Runner()

cases = [(0,5), (-2, 3), (2, 60)]
for s,t in cases:
    time = runner.timeit(name=f'range({s}, {t})', stmt=f"range({s}, {t})")
    time = runner.timeit(name=f'list(range({s}, {t}))', stmt=f"list(range({s}, {t}))")

time = runner.timeit(name=f'range({s}, {t}, 2)', stmt=f"range({s}, {t}, 2)")
time = runner.timeit(name=f'list(range({s}, {t}, 2))', stmt=f"list(range({s}, {t}, 2))")

big = 2**66
s,t=big, big+5
time = runner.timeit(name=f'range({s}, {t}, 2)', stmt=f"range({s}, {t}, 2)")
time = runner.timeit(name=f'list(range({s}, {t}, 2))', stmt=f"list(range({s}, {t}, 2))")
    
stmt="""
v=0
for ii in range(20):
    v += ii*ii
"""
time = runner.timeit(name='for loop with range)', stmt=stmt)

Notes

  • The macro and inline method IS_MEDIUM_VALUE and medium_value have been moved from longobject.c to the internal header files. The names have been kept the same to avoid too many changes. But since they are now visible outside longobject.c we can rename them to _PyLong_IsMediumValue and _PyLong_MediumValue.

@Fidget-Spinner Fidget-Spinner added the performance Performance or resource usage label Jan 4, 2023
@mdickinson
Copy link
Member

Could you try this using PyLong_AsLong instead of MEDIUM_VALUE? This would cover more cases, be more consistent with the existing long-handling in rangeobject.c, and avoid the internal details of the long object representation leaking out too much. I'm not too comfortable with making the range implementation dependent on the longobject internals.

@eendebakpt
Copy link
Contributor Author

Could you try this using PyLong_AsLong instead of MEDIUM_VALUE? This would cover more cases, be more consistent with the existing long-handling in rangeobject.c, and avoid the internal details of the long object representation leaking out too much. I'm not too comfortable with making the range implementation dependent on the longobject internals.

@mdickinson The PyLong_AsLong indeed has many advantages, but it internally sets error values and needs PyErr_Occurred() to disambiguate which impacts performance.

The PyLong_AsLongAndOverflow is also public and perform a bit less work, so I tried that method. The branch with PyLong_AsLongAndOverflow is also faster than main for the common case (but not as fast as this PR). For the case of large integers PyLong_AsLongAndOverflow is slower than main (but not a lot).

Benchmark results

Main:

range(0, 5): Mean +- std dev: 110 ns +- 1 ns
list(range(0, 5)): Mean +- std dev: 243 ns +- 1 ns
range(-2, 3): Mean +- std dev: 110 ns +- 2 ns
list(range(-2, 3)): Mean +- std dev: 244 ns +- 3 ns
range(2, 60): Mean +- std dev: 112 ns +- 1 ns
list(range(2, 60)): Mean +- std dev: 535 ns +- 28 ns
range(2, 60, 2): Mean +- std dev: 115 ns +- 1 ns
list(range(2, 60, 2)): Mean +- std dev: 382 ns +- 11 ns
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: 136 ns +- 1 ns
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: 444 ns +- 3 ns
for loop with range(20): Mean +- std dev: 878 ns +- 12 ns

This PR:

range(0, 5): Mean +- std dev: 56.8 ns +- 1.3 ns
list(range(0, 5)): Mean +- std dev: 181 ns +- 3 ns
range(-2, 3): Mean +- std dev: 56.6 ns +- 1.0 ns
list(range(-2, 3)): Mean +- std dev: 182 ns +- 4 ns
range(2, 60): Mean +- std dev: 57.1 ns +- 1.4 ns
list(range(2, 60)): Mean +- std dev: 472 ns +- 27 ns
range(2, 60, 2): Mean +- std dev: 59.4 ns +- 1.4 ns
list(range(2, 60, 2)): Mean +- std dev: 317 ns +- 5 ns
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: 137 ns +- 3 ns
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: 440 ns +- 6 ns
for loop with range(20): Mean +- std dev: 815 ns +- 21 ns

Branch using PyLong_AsLongAndOverflow: main...eendebakpt:cpython:range_fast_path_v2

range(0, 5): Mean +- std dev: 63.3 ns +- 0.8 ns
list(range(0, 5)): Mean +- std dev: 193 ns +- 3 ns
range(-2, 3): Mean +- std dev: 63.9 ns +- 1.3 ns
list(range(-2, 3)): Mean +- std dev: 194 ns +- 3 ns
range(2, 60): Mean +- std dev: 64.0 ns +- 1.7 ns
list(range(2, 60)): Mean +- std dev: 478 ns +- 10 ns
range(2, 60, 2): Mean +- std dev: 66.8 ns +- 3.5 ns
list(range(2, 60, 2)): Mean +- std dev: 333 ns +- 13 ns
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: 149 ns +- 6 ns
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: 455 ns +- 8 ns
for loop with range(20): Mean +- std dev: 831 ns +- 17 ns

The benchmark of list(range(73786976294838206464, 73786976294838206469, 2)) slows down from main to the PyLong_AsLongOverflow from 444 to 455 ns. (range(73786976294838206464, 73786976294838206469, 2) slows down as well, but a standalone range is hardly used).

I am fine with accepting a minor slowdown for the uncommon case, although perhaps there are more cases to consider than I included in the benchmark.

Which cases would PyLong_AsLong or PyLong_AsLongOverflow cover more than the IS_MEDIUM_VALUE? According to the documentation the input arguments to compute_range_length should satisfy PyLong_Check(arg)=1. So values of like numpy.int64(3) or Decimal(4,1) are converted at an earlier stage (e.g. in range_from_array).

@mdickinson
Copy link
Member

mdickinson commented Jan 4, 2023

Which cases would PyLong_AsLong or PyLong_AsLongOverflow cover more than the IS_MEDIUM_VALUE?

Using a PyLong_AsLong (or a variant) would allow anything that fits in a C long to use the fast path, rather than just those values that fit in a single digit. So on a typical 64-bit machine, that's all values between -2**63 and 2**63-1, rather than just values in the range [-2**30, 2**30] (the "medium" values).

IOW, thinking of the range implementation as a client user of the PyLong implementation, what that client needs is a way to say, efficiently, "tell me whether this object (which I already know is a PyLongObject) fits in a C long, and if so, give me its value as a C long". IS_MEDIUM_VALUE is doing something quite different, that's dependent on the internal representation of a Python long - the range implementation shouldn't know or care about "medium" values.

It's that information about digits, twodigits, medium values, number of bits per digit, etc., that I think we should keep encapsulated within the long implementation - it shouldn't be relevant to users of that implementation. If it starts leaking into other parts of the codebase, that's going to turn into a maintenance headache whenever the long implementation needs tweaking.

If none of the existing public or private long API functions does what we need efficiently (e.g., because it's doing index checks and the like), I'd be open to adding a new function that does answer the "tell me whether this PyLong fits in a C long" question efficiently.

@eendebakpt
Copy link
Contributor Author

@mdickinson I refactored the PyLong_AsLongAndOverflow into a part with error checking and a new (private) _PyLong_AsLongAndOverflow that performs the actual conversion to long and overflow check.

Benchmark of the updated PR against main:

range(0, 5): Mean +- std dev: [main] 110 ns +- 1 ns -> [pr_v4] 61.2 ns +- 1.5 ns: 1.79x faster
list(range(0, 5)): Mean +- std dev: [main] 244 ns +- 2 ns -> [pr_v4] 186 ns +- 1 ns: 1.31x faster
range(-2, 3): Mean +- std dev: [main] 111 ns +- 4 ns -> [pr_v4] 61.6 ns +- 1.5 ns: 1.80x faster
list(range(-2, 3)): Mean +- std dev: [main] 244 ns +- 1 ns -> [pr_v4] 187 ns +- 1 ns: 1.30x faster
range(2, 60): Mean +- std dev: [main] 112 ns +- 1 ns -> [pr_v4] 62.3 ns +- 3.7 ns: 1.79x faster
list(range(2, 60)): Mean +- std dev: [main] 527 ns +- 7 ns -> [pr_v4] 466 ns +- 9 ns: 1.13x faster
range(2, 60, 2): Mean +- std dev: [main] 116 ns +- 1 ns -> [pr_v4] 64.2 ns +- 1.0 ns: 1.80x faster
list(range(2, 60, 2)): Mean +- std dev: [main] 379 ns +- 3 ns -> [pr_v4] 321 ns +- 8 ns: 1.18x faster
range(4398046511104, 4398046511109, 2): Mean +- std dev: [main] 135 ns +- 1 ns -> [pr_v4] 68.1 ns +- 1.9 ns: 1.98x faster
list(range(4398046511104, 4398046511109, 2)): Mean +- std dev: [main] 311 ns +- 4 ns -> [pr_v4] 235 ns +- 4 ns: 1.32x faster
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: [main] 137 ns +- 6 ns -> [pr_v4] 148 ns +- 1 ns: 1.08x slower
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: [main] 444 ns +- 2 ns -> [pr_v4] 447 ns +- 12 ns: 1.01x slower
for loop with range(10): Mean +- std dev: [main] 879 ns +- 17 ns -> [pr_v4] 826 ns +- 16 ns: 1.06x faster
for loop with range(20): Mean +- std dev: [main] 881 ns +- 22 ns -> [pr_v4] 824 ns +- 19 ns: 1.07x faster

Geometric mean: 1.35x faster

Most test cases are faster, now also the cases with long values that are not medium. The new check _PyLong_AsLongAndOverflow is a bit slower than IS_MEDIUM_INT, because it has to do a bit more work.

@eendebakpt eendebakpt changed the title gh-100726: optimize construction of range object for medium sized integers Draft: gh-100726: optimize construction of range object for medium sized integers Jan 4, 2023
@eendebakpt eendebakpt changed the title Draft: gh-100726: optimize construction of range object for medium sized integers gh-100726: optimize construction of range object for medium sized integers Jan 4, 2023
@mdickinson mdickinson self-requested a review January 6, 2023 14:18
@mdickinson
Copy link
Member

Thanks for the updates! I should have time to take a proper look this weekend.

@eendebakpt
Copy link
Contributor Author

@mdickinson Thanks for reviewing. The refactoring of PyLong_AsLongAndOverflow works well for this PR, but I am a little concerned than if the _PyLong_AsLongAndOverflow is not inlined in PyLong_AsLongAndOverflow there could be a minor slowdown for the other method calling PyLong_AsLongAndOverflow. I tried the pyperformance benchmark, but my machine is too unstable to give definite results.

I created #100810 as a simplified version of this PR. Benchmark:

range(0, 5): Mean +- std dev: 65.1 ns +- 1.1 ns
list(range(0, 5)): Mean +- std dev: 194 ns +- 2 ns
range(-2, 3): Mean +- std dev: 65.5 ns +- 0.9 ns
list(range(-2, 3)): Mean +- std dev: 195 ns +- 2 ns
range(2, 60): Mean +- std dev: 65.5 ns +- 0.9 ns
list(range(2, 60)): Mean +- std dev: 477 ns +- 16 ns
range(2, 60, 2): Mean +- std dev: 70.6 ns +- 1.0 ns
list(range(2, 60, 2)): Mean +- std dev: 328 ns +- 5 ns
range(4398046511104, 4398046511109, 2): Mean +- std dev: 76.0 ns +- 0.6 ns
list(range(4398046511104, 4398046511109, 2)): Mean +- std dev: 242 ns +- 1 ns
range(73786976294838206464, 73786976294838206469, 2): Mean +- std dev: 145 ns +- 3 ns
list(range(73786976294838206464, 73786976294838206469, 2)): Mean +- std dev: 450 ns +- 5 ns
range(-1, -1, -1): Mean +- std dev: 67.4 ns +- 1.2 ns
range(-1, -1, 2**66): Mean +- std dev: 248 ns +- 3 ns
for loop with range(10): Mean +- std dev: 826 ns +- 19 ns
for loop with range(20): Mean +- std dev: 827 ns +- 26 ns

The simplified version is not as fast as this PR, but still a solid improvement over current main. I have no strong preference for either one of the two.

The cases that slowdown compared to main (both this PR and the simplified one) involve very large integers that do not fit into a C long, e.g. range(73786976294838206464, 73786976294838206469, 2). I suspect these cases are rare, and if used the computation time will not be determined by the range object itself.

@mdickinson
Copy link
Member

Thanks for the discussion. I confess that I'm much more comfortable with #100810; I'll continue review there.

@eendebakpt eendebakpt closed this Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants