Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable informing callee it's awaited via vector call flag #91121

Open
DinoV opened this issue Mar 9, 2022 · 5 comments
Open

Enable informing callee it's awaited via vector call flag #91121

DinoV opened this issue Mar 9, 2022 · 5 comments
Assignees
Labels
3.11 performance

Comments

@DinoV
Copy link
Contributor

@DinoV DinoV commented Mar 9, 2022

BPO 46965
Nosy @gvanrossum, @carljm, @DinoV, @markshannon, @corona10, @vladima, @brandtbucher, @itamaro

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = 'https://github.com/DinoV'
closed_at = None
created_at = <Date 2022-03-09.00:37:24.565>
labels = ['3.11', 'performance']
title = "Enable informing callee it's awaited via vector call flag"
updated_at = <Date 2022-03-22.16:50:02.812>
user = 'https://github.com/DinoV'

bugs.python.org fields:

activity = <Date 2022-03-22.16:50:02.812>
actor = 'v2m'
assignee = 'dino.viehland'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2022-03-09.00:37:24.565>
creator = 'dino.viehland'
dependencies = []
files = []
hgrepos = []
issue_num = 46965
keywords = []
message_count = 5.0
messages = ['414782', '415301', '415433', '415640', '415795']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'carljm', 'dino.viehland', 'Mark.Shannon', 'corona10', 'v2m', 'brandtbucher', 'itamaro']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue46965'
versions = ['Python 3.11']

@DinoV
Copy link
Contributor Author

@DinoV DinoV commented Mar 9, 2022

The idea here is to add a new flag to the vectorcall nargs that indicates the call is being awaited: _Py_AWAITED_CALL_MARKER. This flag will allow the callee to know that it's being eagerly evaluated. When the call is eagerly evaluated the callee can potentially avoid various amounts of overhead. For a coroutine the function can avoid creating the coroutine object and instead returns a singleton instance of a wait handle indicating eager execution has occurred:
https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L6617

This gives a small win by reducing the overhead of allocating the co-routine object.

For something like gather much more significant wins can be achieved. If all of the inputs have already been computed the creation of tasks and scheduling of them to the event loop can be elided. An example implementation of this is available in Cinder: https://github.com/facebookincubator/cinder/blob/cinder/3.8/Modules/_asynciomodule.c#L7103

Again the gather implementation uses the singleton wait handle object to return the value indicating the computation completed synchronously.

We've used this elsewhere in Cinder as well - for example if we have an "AsyncLazyValue" which lazily performs a one-time computation of a value and caches it. Therefore the common case becomes that the value is already available, and the await can be performed without allocating any intermediate values.

@DinoV DinoV added the 3.11 label Mar 9, 2022
@DinoV DinoV self-assigned this Mar 9, 2022
@DinoV DinoV added performance 3.11 labels Mar 9, 2022
@DinoV DinoV self-assigned this Mar 9, 2022
@DinoV DinoV added the performance label Mar 9, 2022
@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 15, 2022

The link https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L6617 points to something that I wouldn't associate with the subject. @dino, could you provide a new link (preferably a permalink)?

FWIW rather than dynamically checking what the next opcode is, maybe we could use a super-instruction for CALL + GET_AWAITABLE? (Understanding that there are a bunch of different CALL opcodes.)

The gather code you link to is all in C. Is rewriting gather in C the only way to benefit from this speedup? (I guess you could just add your gather impl to the existing _asynciomodule.c, in the same or a separate PR.)

@DinoV
Copy link
Contributor Author

@DinoV DinoV commented Mar 17, 2022

Doh, sorry about that link, this one goes to a specific commit: https://github.com/facebookincubator/cinder/blob/6863212ada4b569c15cd95c4e7a838f254c8ccfb/Python/ceval.c#L6642

I do think a new opcode is a good way to go, and that could just be emitted by the compiler when it recognizes the pattern. I think we mainly avoided that because we had some issues around performance testing when we updated the byte code version and the peek was negligible, but with improved call performance in 3.11 that may not be the case anymore.

It's probably possible to keep most of gather in Python if necessary, there'd still need to be a C wrapper which could flow the wrapper in and the wait handle creation would need to be possible from Python (which slightly scares me). There's probably also a perf win from the C implementation - I'll see if @v2m has any data on that.

@gvanrossum
Copy link
Member

@gvanrossum gvanrossum commented Mar 20, 2022

I will wait until there is a draft PR to review, or until you ping me.--
--Guido (mobile)

@vladima
Copy link
Mannequin

@vladima vladima mannequin commented Mar 22, 2022

  • introducing dedicated opcodes for each kind of awaited call is definitely an option. In fact first implementation used it however as Dino has mentioned it was more of a logistical issue (there were several spots that produced .pyc files so compiler needed to be up to date across all of them).
  • there was some perf win that was coming from rewriting gather in C however main reason for us to do it was the ability to be await-aware which we made only available in C (also returning completed waithandle is not exposed to Python either) to reduce the scope. Providing ability to follow await-aware protocol (read indicator that call is awaited + return completed waithandle for eagerly completed calls) in pure Python is definitely possible
  • to provide some context why it was beneficial: typical implementation of endpoint in IG is an async function that in turn calls into numerous other async functions to generate an output.
    • gather is used all over the place in case if there are no sequential dependency between calls
    • amount of unique pieces of data that are ultimately fetched by async calls is not very big, i.e. the same fragment of information can be requested by different async calls which makes memoization a very attractive strategy to reduce I/O and heavyweight computations.
    • memoized pieces of data is represented effectively by completed futures and it is very common to have gather accept either memoized value or coroutine object that will be completed synchronously by awaiting memoized value.

Before making gather await-aware if always have to follow the standard process and convert awaitables into tasks that are queued into the event loop for execution. In our workload task creation/queueing were adding a noticeable overhead. With await-aware gather we can execute coroutine objects eagerly and if they were not suspended - bypass task creation entirely.

import asyncio
import time

async def step(i):
    if i == 0:
        return
    await asyncio.gather(*[step(i - 1) for _ in range(6)])

async def main():
    t0 = time.perf_counter()
    await step(6)
    t1 = time.perf_counter()
    print(f"{t1 - t0} s")

N = 0
def create_task(loop, coro):
    global N
    N += 1
    return asyncio.Task(coro, loop=loop)

loop = asyncio.get_event_loop()
loop.set_task_factory(create_task)
loop.run_until_complete(main())
print(f"{N} tasks created")

# Cinder
# 0.028410961851477623 s
# 1 tasks created

# Python 3.8
# 1.2157012447714806 s
# 55987 tasks created

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
ericsnowcurrently pushed a commit to python/pyperformance that referenced this issue May 16, 2022
Add a benchmark for testing async workloads, specifically an async tree workload that simulates simpler versions of a typical Instagram endpoint.  (See python/cpython#91121.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 performance
Projects
None yet
Development

No branches or pull requests

2 participants