Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-47009: Streamline list.append for the common case #31864

Merged
merged 4 commits into from Apr 1, 2022

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Mar 14, 2022

@sweeneyde
Copy link
Member Author

sweeneyde commented Mar 14, 2022

Benchmarks are good:

from pyperf import Runner, perf_counter

def bench_listcomp(loops, length):
    src = list(map(float, range(length)))
    t0 = perf_counter()
    for i in range(loops):
        [x for x in src]
    return perf_counter() - t0

def bench_append(loops, length):
    src = list(map(float, range(length)))
    t0 = perf_counter()
    for i in range(loops):
        arr = []
        for x in src:
            arr.append(x)
    return perf_counter() - t0

runner = Runner()
for n in [100, 1_000, 10_000, 100_000]:
    runner.bench_time_func(f"listcomp {n}", bench_listcomp, n)
    runner.bench_time_func(f"append {n}", bench_append, n)

Results from GCC on WSL with --enable-optimizations --with-lto

Faster (8):
- listcomp 10000: 118 us +- 2 us -> 92.6 us +- 1.5 us: 1.28x faster
- listcomp 100000: 1.16 ms +- 0.02 ms -> 916 us +- 26 us: 1.27x faster
- listcomp 1000: 12.3 us +- 0.2 us -> 9.89 us +- 0.41 us: 1.25x faster
- listcomp 100: 1.59 us +- 0.03 us -> 1.32 us +- 0.05 us: 1.21x faster
- append 100000: 1.69 ms +- 0.05 ms -> 1.45 ms +- 0.04 ms: 1.17x faster
- append 10000: 168 us +- 4 us -> 145 us +- 3 us: 1.16x faster
- append 1000: 17.4 us +- 0.3 us -> 15.2 us +- 0.6 us: 1.14x faster
- append 100: 2.03 us +- 0.06 us -> 1.81 us +- 0.08 us: 1.12x faster

Geometric mean: 1.20x faster

@sweeneyde sweeneyde requested a review from methane Mar 14, 2022
@sweeneyde sweeneyde added the performance Performance or resource usage label Mar 14, 2022
Copy link
Member

@markshannon markshannon left a comment

I think this would be simpler, and just as fast, by renaming app1 to _PyList_AppendTakeRef (with appropriate refcount adjustments).
No need for the inline functions.

PyAPI_FUNC(int)
_PyList_AppendTakeRefListResize(PyListObject *self, PyObject *newitem);

static inline int
Copy link
Member

@markshannon markshannon Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an inline function.
Let the LTO pass decide if it wants to inline it.

app1(PyListObject *self, PyObject *v)
/* internal, used by _PyList_AppendTakeRef */
int
_PyList_AppendTakeRefListResize(PyListObject *self, PyObject *newitem)
Copy link
Member

@markshannon markshannon Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the extra function?
Converting app1 to _PyList_AppendTakeRef would seem to be enough.

Copy link
Member Author

@sweeneyde sweeneyde Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like that and the results are on the bpo issue, but it seems there is a significant benefit to having the separate as-small-as-possible function that always gets inlined as opposed to a slightly longer version that we let the compiler figure out.

Copy link
Member

@markshannon markshannon Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably the compiler isn't inlining the function because it isn't hot enough.
PRECALL_NO_KW_LIST_APPEND and LIST_APPEND are only 0.3% of all instructions executed, so maybe it is best not to inline.

Do you have benchmark numbers for the standard suite?

Copy link
Member Author

@sweeneyde sweeneyde Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@sweeneyde sweeneyde Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, IMO there is a relative lack of list comprehensions in the pyperformance suite.

@sweeneyde
Copy link
Member Author

sweeneyde commented Mar 31, 2022

@markshannon Since pyperformance results were negligible and this has a significant speedup on operations that live in hot loops (albeit not in pyperformance), is it okay if I merge this?

@markshannon
Copy link
Member

markshannon commented Apr 1, 2022

Sorry, this dropped off my radar.

@markshannon markshannon merged commit a0ea7a1 into python:main Apr 1, 2022
12 checks passed
@sweeneyde sweeneyde deleted the listappend branch Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants