Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-91432: Specialize FOR_ITER #91713

Merged
merged 44 commits into from Jun 21, 2022
Merged

gh-91432: Specialize FOR_ITER #91713

merged 44 commits into from Jun 21, 2022

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Apr 19, 2022

Faster-Cpython thread: faster-cpython/ideas#67
GH issue: #91432

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 19, 2022

Microbenchmarks so far:

benchmark:

from pyperf import Runner, perf_counter
from itertools import repeat

def for_range(loops, length):
    repetitions = repeat(None, loops)
    R = range(length)

    t0 = perf_counter()
    for _ in repetitions:
        for x in R:
            pass
    t1 = perf_counter()

    return t1 - t0

def for_list(loops, length):
    repetitions = repeat(None, loops)
    L = list(map(float, range(length)))

    t0 = perf_counter()
    for _ in repetitions:
        for x in L:
            pass
    t1 = perf_counter()

    return t1 - t0

bench = Runner().bench_time_func
for n in [20, 200, 2_000, 20_000, 200_000]:
    bench(f"for_range {n:_}", for_range, n, inner_loops=n)
    bench(f"for_list {n:_}", for_list, n, inner_loops=n)

Result:

Faster (10):
- for_range 200_000: 11.6 ns +- 0.4 ns -> 3.90 ns +- 0.02 ns: 2.97x faster
- for_range 20_000: 11.3 ns +- 0.1 ns -> 3.91 ns +- 0.03 ns: 2.90x faster
- for_range 2_000: 10.7 ns +- 0.2 ns -> 3.97 ns +- 0.03 ns: 2.69x faster
- for_range 200: 6.41 ns +- 0.16 ns -> 4.27 ns +- 0.06 ns: 1.50x faster
- for_range 20: 8.82 ns +- 0.48 ns -> 7.12 ns +- 0.42 ns: 1.24x faster
- for_list 20: 6.99 ns +- 0.50 ns -> 6.11 ns +- 0.52 ns: 1.14x faster
- for_list 20_000: 5.12 ns +- 0.13 ns -> 4.55 ns +- 0.05 ns: 1.12x faster
- for_list 200: 5.19 ns +- 0.12 ns -> 4.62 ns +- 0.11 ns: 1.12x faster
- for_list 2_000: 5.02 ns +- 0.05 ns -> 4.53 ns +- 0.09 ns: 1.11x faster
- for_list 200_000: 5.57 ns +- 0.08 ns -> 5.06 ns +- 0.06 ns: 1.10x faster

Geometric mean: 1.54x faster

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 19, 2022

The one unfortunate thing about this implementation so far is that it restricts the fast range iterator (which currently uses arbitrary c longs), to only values that will fit in a 1-digit int, which is abs(x) < 2**30 or abs(x) < 2**15, depending on platform.

@sweeneyde sweeneyde added performance interpreter-core 3.11 labels Apr 19, 2022
@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 20, 2022

Pyperformance: Geometric mean: 1.01x faster:

Results:

Slower (9):
- pyflate: 414 ms +- 8 ms -> 424 ms +- 5 ms: 1.02x slower
- pidigits: 179 ms +- 1 ms -> 183 ms +- 2 ms: 1.02x slower
- dulwich_log: 75.4 ms +- 1.2 ms -> 76.9 ms +- 1.8 ms: 1.02x slower
- chameleon: 6.20 ms +- 0.25 ms -> 6.32 ms +- 0.24 ms: 1.02x slower
- pickle_list: 4.43 us +- 0.13 us -> 4.51 us +- 0.09 us: 1.02x slower
- unpickle_list: 4.77 us +- 0.07 us -> 4.86 us +- 0.08 us: 1.02x slower
- pathlib: 19.0 ms +- 0.5 ms -> 19.3 ms +- 0.3 ms: 1.01x slower
- scimark_lu: 98.1 ms +- 1.2 ms -> 99.3 ms +- 1.7 ms: 1.01x slower
- go: 139 ms +- 2 ms -> 140 ms +- 3 ms: 1.01x slower

Faster (26):
- scimark_sparse_mat_mult: 5.02 ms +- 0.18 ms -> 4.46 ms +- 0.30 ms: 1.12x faster
- nbody: 89.9 ms +- 2.1 ms -> 83.5 ms +- 5.4 ms: 1.08x faster
- mako: 10.1 ms +- 0.9 ms -> 9.42 ms +- 0.11 ms: 1.08x faster
- pickle: 12.4 us +- 0.7 us -> 11.6 us +- 1.1 us: 1.07x faster
- float: 75.7 ms +- 1.5 ms -> 71.5 ms +- 1.1 ms: 1.06x faster
- unpack_sequence: 44.6 ns +- 5.0 ns -> 42.5 ns +- 0.8 ns: 1.05x faster
- meteor_contest: 109 ms +- 3 ms -> 105 ms +- 2 ms: 1.04x faster
- regex_v8: 22.1 ms +- 0.4 ms -> 21.3 ms +- 0.3 ms: 1.04x faster
- crypto_pyaes: 85.4 ms +- 2.4 ms -> 82.3 ms +- 1.9 ms: 1.04x faster
- telco: 6.36 ms +- 0.15 ms -> 6.13 ms +- 0.16 ms: 1.04x faster
- chaos: 72.9 ms +- 1.7 ms -> 70.4 ms +- 1.9 ms: 1.04x faster
- unpickle: 16.0 us +- 1.1 us -> 15.5 us +- 1.2 us: 1.03x faster
- django_template: 37.7 ms +- 0.8 ms -> 36.6 ms +- 1.8 ms: 1.03x faster
- html5lib: 66.7 ms +- 2.6 ms -> 64.8 ms +- 2.6 ms: 1.03x faster
- regex_effbot: 2.87 ms +- 0.04 ms -> 2.80 ms +- 0.03 ms: 1.02x faster
- sqlalchemy_imperative: 23.7 ms +- 0.6 ms -> 23.2 ms +- 0.9 ms: 1.02x faster
- regex_dna: 197 ms +- 3 ms -> 193 ms +- 4 ms: 1.02x faster
- unpickle_pure_python: 231 us +- 7 us -> 227 us +- 5 us: 1.02x faster
- fannkuch: 384 ms +- 6 ms -> 376 ms +- 6 ms: 1.02x faster
- regex_compile: 138 ms +- 2 ms -> 136 ms +- 3 ms: 1.02x faster
- hexiom: 6.30 ms +- 0.12 ms -> 6.20 ms +- 0.14 ms: 1.02x faster
- sqlite_synth: 2.35 us +- 0.04 us -> 2.32 us +- 0.03 us: 1.01x faster
- logging_silent: 92.5 ns +- 2.0 ns -> 91.6 ns +- 1.7 ns: 1.01x faster
- sympy_integrate: 20.9 ms +- 0.3 ms -> 20.7 ms +- 0.3 ms: 1.01x faster
- xml_etree_process: 52.5 ms +- 1.0 ms -> 52.2 ms +- 0.8 ms: 1.01x faster
- scimark_sor: 107 ms +- 2 ms -> 106 ms +- 2 ms: 1.01x faster

Benchmark hidden because not significant (24): 2to3, deltablue, json_dumps, json_loads, logging_format, logging_simple, nqueens, pickle_dict, pickle_pure_python, python_startup, python_startup_no_site, raytrace, richards, scimark_fft, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sympy_expand, sympy_sum, sympy_str, tornado_http, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.01x faster

@markshannon
Copy link
Member

@markshannon markshannon commented Apr 20, 2022

I'm not seeing as much of a speedup, but I think this is worthwhile even if it produces no speedup, as it enables specialization for other types, specifically generators.

@sweeneyde

This comment has been hidden.

Copy link
Member

@markshannon markshannon left a comment

This adds quite a lot of complexity to the specialization for range, handling the following store as well.

What sort of performance do you get just specializing the FOR_ITER and not the following STORE_FAST?

# check first 100 entries
self.assert_iterators_equal(iter1, iter2, test_id, limit=100)
# check first 10 entries
self.assert_iterators_equal(iter1, iter2, test_id, limit=10)
Copy link
Member

@markshannon markshannon Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reduce from 100 to 10?

Copy link
Member Author

@sweeneyde sweeneyde Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it 4x slower otherwise by doubling the number of M values, and it started to feel a bit sluggish, so I wanted to offset that.

it->start = start;
it->step = step;
it->len = len;
it->start = Py_SAFE_DOWNCAST(start, long, sdigit);
Copy link
Member

@markshannon markshannon Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to change the signature of fast_range_iter to take sdigits, rather than longs?

Copy link
Member Author

@sweeneyde sweeneyde Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure -- downcasts need to happen somewhere, and fast_range_iter is called in two places. The PyLong_AsLong calls could theoretically be replaced with Py_SIZE checks and ob_digit[0] accesses, but I think the existing code is fine using the public API.

Python/ceval.c Outdated Show resolved Hide resolved
@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Apr 20, 2022

What sort of performance do you get just specializing the FOR_ITER and not the following STORE_FAST?

I measured no speedup at all, even on microbenchmarks, when the range opcode was just something like

            if (r->index < r->len) {
                PyObject *res = PyLong_FromLong(
                    (long)(r->start + (unsigned long)(r->index++) * r->step));
                if (res == NULL) {
                    goto error;
                }
                PUSH(res);
                PREDICT(STORE_FAST);
                NOTRACE_DISPATCH();
            }

@sweeneyde sweeneyde changed the title Specialize FOR_ITER gh-91432: Specialize FOR_ITER Apr 24, 2022
@markshannon
Copy link
Member

@markshannon markshannon commented Jun 10, 2022

@sweeneyde
Thanks for persisting with this.
How do you think this compares to faster-cpython/ideas#392?

@sweeneyde
Copy link
Member Author

@sweeneyde sweeneyde commented Jun 10, 2022

How do you think this compares to faster-cpython/ideas#392?

I'd think that this is conceptually simpler with the existing specialization framework and lack of tagged stack values, but the "Virtual iterators" idea seems like it could be worth exploring more. IIUC, putting an integer on the stack rather than off in some ListIterObject saves a level of indirection and helps cache locality, but I have no idea how to predict how significant that would be.

Python/ceval.c Outdated
}
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1);
sdigit value = r->start + (digit)(r->index++) * r->step;
if (value < _PY_NSMALLPOSINTS && value >= -_PY_NSMALLNEGINTS) {
Copy link
Member

@markshannon markshannon Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are too many branches here.
Ideally a specialized instruction should have a few DEOPT_IF tests, then the rest of the code should be (mostly) branchless.

This also includes far too much knowledge of the internals of the int object, IMO.

Would performance be measurably worse if start, end and stop were Py_ssize_t and this code were simplified to
PyLong_FromSsize_t?
This feels like an elaborate workaround the oddities of our int implementation.
We should be fixing that, not working around it.

Copy link
Member Author

@sweeneyde sweeneyde Jun 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a middle ground here: add a function like this to longobject.c:

int
_PyLong_AssignValueLong(PyLongObject **location, long value)
{
    if value in small:
        set to small, decref old long
    elif refcnt(*location) == 1 and the right size:
        mutate in place
    else:
        Py_SETREF(*location, PyLong_FromLong(value));
}

Sort of similar in spirit to PyUnicode_Append()

That way, the PyLongObject implementation details stay out of the eval loop, but we still get the benefit of avoiding the allocation, which I believe is the main source of the speedup.

Copy link
Member Author

@sweeneyde sweeneyde Jun 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could theoretically be re-used for enumerate() as well.

And even if the implementation of integers changes (e.g. new freelists, no more small ints), the operation still makes sense.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 17, 2022

Ultimately we want to have virtual iterators, and specialization of FOR_ITER for lists, ranges and generators (and maybe others?).

If I were doing the whole thing myself, from scratch, I would do the following:

  1. Virtual iterators
  2. Specialize for lists and ranges.
  3. Specialize for generators.

Since we already have this PR, it probably makes sense to do 2 first, but I think we need to simplify FOR_ITER_RANGE first.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 20, 2022

Pyperformance shows a 1% speedup with 35 faster, and 15 slower.

+-------------------------+------------------------------------+-------------------------------------------------+
| Benchmark | 2022-06-18_20-47-main-6066f450b91f | 2022-06-19_10-01-special_for_iter2-d55868f9a354 |
+=========================+====================================+=================================================+
| scimark_sparse_mat_mult | 7.32 ms | 6.40 ms: 1.14x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| telco | 11.3 ms | 10.7 ms: 1.06x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| html5lib | 106 ms | 99.9 ms: 1.06x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_fft | 550 ms | 526 ms: 1.05x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_compile | 217 ms | 209 ms: 1.04x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_silent | 160 ns | 156 ns: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_monte_carlo | 111 ms | 108 ms: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| mako | 17.3 ms | 16.9 ms: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| hexiom | 10.3 ms | 10.1 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_v8 | 36.1 ms | 35.5 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_effbot | 5.92 ms | 5.82 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_pure_python | 476 us | 470 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle | 17.6 us | 17.3 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| crypto_pyaes | 123 ms | 122 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| json_dumps | 21.1 ms | 20.9 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_dna | 355 ms | 351 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_lu | 183 ms | 181 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| raytrace | 475 ms | 470 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| spectral_norm | 164 ms | 162 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle_pure_python | 343 us | 340 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_format | 11.0 us | 10.9 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| nqueens | 142 ms | 141 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| sympy_integrate | 33.6 ms | 33.4 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| xml_etree_iterparse | 165 ms | 164 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pyflate | 674 ms | 669 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| xml_etree_generate | 131 ms | 130 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| float | 123 ms | 122 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_sor | 192 ms | 191 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| deltablue | 5.52 ms | 5.48 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| meteor_contest | 173 ms | 172 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_simple | 10.1 us | 10.1 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| sympy_str | 462 ms | 459 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| python_startup | 14.0 ms | 14.0 ms: 1.00x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| python_startup_no_site | 10.3 ms | 10.3 ms: 1.00x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pidigits | 324 ms | 325 ms: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| 2to3 | 407 ms | 408 ms: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_list | 6.90 us | 6.92 us: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| django_template | 53.6 ms | 53.9 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| genshi_text | 35.8 ms | 36.0 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| sqlite_synth | 4.04 us | 4.07 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_dict | 50.0 us | 50.4 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle | 21.5 us | 21.7 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| chameleon | 11.2 ms | 11.3 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpack_sequence | 75.5 ns | 76.3 ns: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle_list | 7.92 us | 8.02 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| dulwich_log | 105 ms | 106 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| fannkuch | 620 ms | 631 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| chaos | 112 ms | 114 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| go | 219 ms | 224 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| Geometric mean | (ref) | 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+

Benchmark hidden because not significant (12): tornado_http, sqlalchemy_imperative, sqlalchemy_declarative, sympy_sum, json_loads, sympy_expand, nbody, pathlib, xml_etree_process, genshi_xml, xml_etree_parse, richards

Copy link
Member

@markshannon markshannon left a comment

A couple of minor issues, otherwise looks good.

@@ -262,6 +262,35 @@ _PyLong_FromSTwoDigits(stwodigits x)
return _PyLong_FromLarge(x);
}

int
_PyLong_AssignValue(PyObject **target, long value)
Copy link
Member

@markshannon markshannon Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this Py_ssize_t rather than long?
On 64 bit windows, sizeof(long) != sizeof(void *) which makes reasoning about code with long harder.

Copy link
Member Author

@sweeneyde sweeneyde Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change PyRangeIterator to use Py_ssize_t as well?

Copy link
Member

@markshannon markshannon Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately yes. But maybe not in this PR, since that isn't new code.

if (_PyLong_AssignValue(&GETLOCAL(_Py_OPARG(next)), value) < 0) {
goto error;
}
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1);
Copy link
Member

@markshannon markshannon Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why this is INLINE_CACHE_ENTRIES_FOR_ITER + 1 rather than just INLINE_CACHE_ENTRIES_FOR_ITER.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Jun 20, 2022

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Py_XDECREF(old);
return 0;
}
else if (old != NULL && PyLong_CheckExact(old) &&
Copy link
Member

@markshannon markshannon Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works for positive integers.
Given that it is only used for range iteration, that's fine for now.
But could you add a comment, so it doesn't get overlooked when we implement faster-cpython/ideas#147

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 20, 2022

Thanks. Looks good.

@markshannon markshannon added the 🔨 test-with-buildbots label Jun 20, 2022
@markshannon
Copy link
Member

@markshannon markshannon commented Jun 20, 2022

I'll merge once the build bots are happy.

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Jun 20, 2022

🤖 New build scheduled with the buildbot fleet by @markshannon for commit db21da1 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots label Jun 20, 2022
@markshannon markshannon merged commit 5fcfdd8 into python:main Jun 21, 2022
85 of 86 checks passed
@sweeneyde sweeneyde deleted the special_for_iter2 branch Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 interpreter-core performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants