New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-91432: Specialize FOR_ITER #91713
Conversation
Microbenchmarks so far: benchmark: from pyperf import Runner, perf_counter
from itertools import repeat
def for_range(loops, length):
repetitions = repeat(None, loops)
R = range(length)
t0 = perf_counter()
for _ in repetitions:
for x in R:
pass
t1 = perf_counter()
return t1 - t0
def for_list(loops, length):
repetitions = repeat(None, loops)
L = list(map(float, range(length)))
t0 = perf_counter()
for _ in repetitions:
for x in L:
pass
t1 = perf_counter()
return t1 - t0
bench = Runner().bench_time_func
for n in [20, 200, 2_000, 20_000, 200_000]:
bench(f"for_range {n:_}", for_range, n, inner_loops=n)
bench(f"for_list {n:_}", for_list, n, inner_loops=n) Result:
|
The one unfortunate thing about this implementation so far is that it restricts the fast range iterator (which currently uses arbitrary c longs), to only values that will fit in a 1-digit int, which is |
Pyperformance: Geometric mean: 1.01x faster: Results:
|
I'm not seeing as much of a speedup, but I think this is worthwhile even if it produces no speedup, as it enables specialization for other types, specifically generators. |
This comment has been hidden.
This comment has been hidden.
This adds quite a lot of complexity to the specialization for range
, handling the following store as well.
What sort of performance do you get just specializing the FOR_ITER
and not the following STORE_FAST
?
Lib/test/test_range.py
Outdated
# check first 100 entries | ||
self.assert_iterators_equal(iter1, iter2, test_id, limit=100) | ||
# check first 10 entries | ||
self.assert_iterators_equal(iter1, iter2, test_id, limit=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why reduce from 100 to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it 4x slower otherwise by doubling the number of M
values, and it started to feel a bit sluggish, so I wanted to offset that.
Objects/rangeobject.c
Outdated
it->start = start; | ||
it->step = step; | ||
it->len = len; | ||
it->start = Py_SAFE_DOWNCAST(start, long, sdigit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to change the signature of fast_range_iter
to take sdigits, rather than longs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure -- downcasts need to happen somewhere, and fast_range_iter
is called in two places. The PyLong_AsLong
calls could theoretically be replaced with Py_SIZE
checks and ob_digit[0]
accesses, but I think the existing code is fine using the public API.
I measured no speedup at all, even on microbenchmarks, when the range opcode was just something like if (r->index < r->len) {
PyObject *res = PyLong_FromLong(
(long)(r->start + (unsigned long)(r->index++) * r->step));
if (res == NULL) {
goto error;
}
PUSH(res);
PREDICT(STORE_FAST);
NOTRACE_DISPATCH();
} |
@sweeneyde |
I'd think that this is conceptually simpler with the existing specialization framework and lack of tagged stack values, but the "Virtual iterators" idea seems like it could be worth exploring more. IIUC, putting an integer on the stack rather than off in some |
Python/ceval.c
Outdated
} | ||
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1); | ||
sdigit value = r->start + (digit)(r->index++) * r->step; | ||
if (value < _PY_NSMALLPOSINTS && value >= -_PY_NSMALLNEGINTS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are too many branches here.
Ideally a specialized instruction should have a few DEOPT_IF
tests, then the rest of the code should be (mostly) branchless.
This also includes far too much knowledge of the internals of the int
object, IMO.
Would performance be measurably worse if start
, end
and stop
were Py_ssize_t
and this code were simplified to
PyLong_FromSsize_t
?
This feels like an elaborate workaround the oddities of our int
implementation.
We should be fixing that, not working around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be a middle ground here: add a function like this to longobject.c:
int
_PyLong_AssignValueLong(PyLongObject **location, long value)
{
if value in small:
set to small, decref old long
elif refcnt(*location) == 1 and the right size:
mutate in place
else:
Py_SETREF(*location, PyLong_FromLong(value));
}
Sort of similar in spirit to PyUnicode_Append()
That way, the PyLongObject implementation details stay out of the eval loop, but we still get the benefit of avoiding the allocation, which I believe is the main source of the speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could theoretically be re-used for enumerate()
as well.
And even if the implementation of integers changes (e.g. new freelists, no more small ints), the operation still makes sense.
Ultimately we want to have virtual iterators, and specialization of If I were doing the whole thing myself, from scratch, I would do the following:
Since we already have this PR, it probably makes sense to do 2 first, but I think we need to simplify |
Pyperformance shows a 1% speedup with 35 faster, and 15 slower. +-------------------------+------------------------------------+-------------------------------------------------+ Benchmark hidden because not significant (12): tornado_http, sqlalchemy_imperative, sqlalchemy_declarative, sympy_sum, json_loads, sympy_expand, nbody, pathlib, xml_etree_process, genshi_xml, xml_etree_parse, richards |
Objects/longobject.c
Outdated
@@ -262,6 +262,35 @@ _PyLong_FromSTwoDigits(stwodigits x) | |||
return _PyLong_FromLarge(x); | |||
} | |||
|
|||
int | |||
_PyLong_AssignValue(PyObject **target, long value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this Py_ssize_t
rather than long
?
On 64 bit windows, sizeof(long) != sizeof(void *)
which makes reasoning about code with long
harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change PyRangeIterator to use Py_ssize_t as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately yes. But maybe not in this PR, since that isn't new code.
if (_PyLong_AssignValue(&GETLOCAL(_Py_OPARG(next)), value) < 0) { | ||
goto error; | ||
} | ||
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why this is INLINE_CACHE_ENTRIES_FOR_ITER + 1
rather than just INLINE_CACHE_ENTRIES_FOR_ITER
.
When you're done making the requested changes, leave the comment: |
Py_XDECREF(old); | ||
return 0; | ||
} | ||
else if (old != NULL && PyLong_CheckExact(old) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works for positive integers.
Given that it is only used for range
iteration, that's fine for now.
But could you add a comment, so it doesn't get overlooked when we implement faster-cpython/ideas#147
Thanks. Looks good. |
I'll merge once the build bots are happy. |
If you want to schedule another build, you need to add the " |
Faster-Cpython thread: faster-cpython/ideas#67
GH issue: #91432