Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-97933: inline list/dict/set comprehensions in functions #101441

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

carljm
Copy link
Contributor

@carljm carljm commented Jan 30, 2023

Closes #97933.

In builds with --enable-optimizations:

➜ ./python -m pyperf timeit -s 'l = [1, 2, 3, 4, 5]' '[x for x in l]' --compare-to ../mainopt/python
/home/carljm/build/mainopt/python: ..................... 200 ns +- 3 ns
/home/carljm/build/ic2opt/python: ..................... 120 ns +- 2 ns

Mean +- std dev: [/home/carljm/build/mainopt/python] 200 ns +- 3 ns -> [/home/carljm/build/ic2opt/python] 120 ns +- 2 ns: 1.67x faster

Credit to @vladima for authoring the original version of this code in Cinder 3.8. The compiler approach is different in this version (so as to support inlining all comprehensions, regardless of name clashes); some of the symbol table changes remain the same.

A couple implementation notes:

  • We need a new opcode LOAD_FAST_AND_CLEAR because we need to push/pop the value of possibly-undefined variables on the stack while clearing them. To do this with existing opcodes would require eliminating the invariant/assert that LOAD_FAST never loads NULL, and would add inefficient refcount operations and interpreter loop cycles.
  • If a comprehension iteration variable is a cell inside the comprehension (i.e. the comprehension builds closures) we end up including that variable in both co_varnames and co_cellvars for the inlined-into outer function and using the non-offset (varnames) storage location for it. This is how function arguments that are also cells are currently handled, so we just piggyback on that handling. This makes the needed push/pop of the outer cell simpler as we can just use LOAD_FAST_AND_CLEAR/STORE_FAST as-is for it, and it will push/pop the cell itself. Otherwise we would need some new handling in the compiler for allowing push/pop of a cell in offset storage.

@carljm
Copy link
Contributor Author

carljm commented Jan 31, 2023

@markshannon @iritkatriel would be grateful for a pyperformance run on this one (presuming signal all looks good.)

@carljm carljm changed the title gh-97933: inline sync list/dict/set comprehensions gh-97933: inline list/dict/set comprehensions in functions Jan 31, 2023
@carljm carljm added the performance Performance or resource usage label Jan 31, 2023
@markshannon
Copy link
Member

I'll run the benchmarks now

@markshannon
Copy link
Member

Benchmarking is delayed a bit, but should be back tomorrow at the latest.

@markshannon
Copy link
Member

markshannon commented Jan 31, 2023

You don't seem to be clearing the inner variable before use, so that

def f():
    l = [ None ]
    [ (l[0], l) in [[1,2]] ]
    print(l)

will, I think, print [1] instead of raising an UnboundLocalError

You can actually make the code more efficient by clearing the local when stashing it on the stack:

inst(LOAD_FAST_AND_CLEAR, ( -- val)) {
    val = LOCALS[oparg];
    LOCALS[oparg] = NULL;
}

as there is no need for a NULL check or INCREF.

Python/compile.c Outdated
// could be cleverer about minimizing swaps but it doesn't matter
// because `apply_static_swaps` will eliminate all of these anyway
ADDOP_I(c, loc, SWAP, 2);
ADDOP_NAME(c, loc, STORE_FAST, k, varnames);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be storing NULL here. That is OK for the interpreter, but the dataflow analysis in add_checks_for_loads_of_uninitialized_variables in the compiler assumes that STORE_FAST leaves the variable initialized.

What you'll need to do is add a pseudo instruction, and convert that to STORE_FAST in the assembler. That way the dataflow analysis won't be fooled.

You might want to update add_checks_for_loads_of_uninitialized_variables to understand the LOAD_FAST_AND_CLEAR, STORE_FAST_MAYBE_NULL pairs, but that should wait for another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test case:

def f():
     if False:
          x = 0
     [x for _ in [1]]
     return x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, thanks!

The test case needs to use [x for x in [1]], not [x for _ in [1]] -- in the latter case, x is not bound in the comprehension and is just a free var (converted to local by inlining) so we don't push/pop it at all.

@markshannon
Copy link
Member

markshannon commented Jan 31, 2023

A couple of obscure issues, but this looks like a very nice improvement.

I think the gain in performance is well worth the cost in code size, given how common comprehensions are.

* main:
  pythongh-101440: fix json snippet error in logging-cookbook.rst (python#101439)
  pythongh-99276 - Updated Doc/faq/general.rst (python#101396)
  Add JOBS parameter to docs Makefile (python#101395)
  pythongh-98831: rewrite GET_LEN, GET_ITER, BEFORE_WITH and a few simple opcodes in the instruction definition DSL (python#101443)
  pythongh-77607: Improve accuracy of os.path.join docs (python#101406)
  Fixes typo in asyncio.TaskGroup context manager code example (python#101449)
  pythongh-98831: Clean up and add cache size static_assert to macro (python#101442)
  pythongh-99955: use SUCCESS/ERROR return values in optimizer and assembler. Use RETURN_IF_ERROR where appropriate. Fix a couple of bugs. (python#101412)
@carljm
Copy link
Contributor Author

carljm commented Jan 31, 2023

You don't seem to be clearing the inner variable before use

Your example code doesn't involve a comprehension, I'm assuming you meant something like

def f():
    l = [ None ]
    [ (l[0], l) for l in [[1,2]] ]
    print(l)

But this doesn't raise UnboundLocalError on main either, it works fine; the iteration variable l is always bound within the comprehension before the expression is evaluated.

So far I haven't been able to construct a test case where failing to clear an incoming variable bound inside the comprehension has any visible effect, because AFAICT any locals in a comprehension are always (necessarily by the syntactic limitations of how a comprehension is built) defined before use. Let me know if you can think of a test case I'm missing!

You can actually make the code more efficient by clearing the local when stashing it on the stack

But I think this is sufficient reason regardless to justify switching from LOAD_FAST_OR_NULL to LOAD_FAST_AND_CLEAR, so I'll do that!

@carljm
Copy link
Contributor Author

carljm commented Jan 31, 2023

Actually I don't think LOAD_FAST_AND_CLEAR will work, because of cases like this:

def f(x):
    return [x for x in x]

We need to save/restore the outer value of x, but we cannot clear it on entry because we are also relying on being able to load it when we evaluate the iterable part of the comprehension (which in the non-inlined case would have been evaluated outside the function and passed in as the .0 parameter.) So I think we need to stick with a non-clearing LOAD_FAST_OR_NULL on entry.

If there were a correctness reason for needing to clear in other cases, I could add an additional DELETE_FAST after the LOAD_FAST_OR_NULL, but so far I haven't found any such correctness need, and clearing this way is not an efficiency improvement.

EDIT: there is also the option of doing clearing pushes and also moving evaluation of the outermost iterable to before the pushes; this would require some non-elidable SWAPs to preserve it as TOS through the pushes, though...

@carljm
Copy link
Contributor Author

carljm commented Jan 31, 2023

@markshannon I think LOAD_FAST_OR_NULL could also become a pseudo-instruction that resolves to LOAD_FAST (which would get around having it turned into LOAD_FAST_CHECK), but we'd have to remove the assert(value != NULL) in the LOAD_FAST implementation. Do you prefer keeping that assertion at the cost of another (non-pseudo) opcode?

@carljm
Copy link
Contributor Author

carljm commented Jan 31, 2023

I've addressed the review comments. I have some TODOs for potential optimizations that could be done here or in a follow-up diff:

  • Eliminate some unnecessary push/pops in cases where the value is never loaded after the comprehension. This would fall into the general category of dead store elimination. I'm not sure if we can actually do this or if it violates some guarantees about introspectable frame state in tracing?
  • Better handling of SWAPs on the post-comprehension pops. apply_static_swaps can't help us here if the comprehension value is immediately returned (because we can't "swap" a RETURN_VALUE -- this case would go away if we can do the dead store elimination, since if it's right before a RETURN_VALUE it's clearly a dead store) or if it's immediately popped (because the POP_TOP will have a lineno of -1 and apply_static_swaps will thus refuse to swap it). Instead of having one SWAP 2 per pop, we could have a single SWAP N and do the last pop first.
  • Teach add_checks_for_loads_of_uninitialized_variables to understand and trace through the push/pop pairs.

* main:
  pythonGH-100288: Skip extra work when failing to specialize LOAD_ATTR (pythonGH-101354)
  pythongh-101409: Improve generated clinic code for self type checks (python#101411)
  pythongh-98831: rewrite BEFORE_ASYNC_WITH and END_ASYNC_FOR in the instruction definition DSL (python#101458)
  pythongh-101469: Optimise get_io_state() by using _PyModule_GetState() (pythonGH-101470)
@markshannon
Copy link
Member

Sorry for posting a gibberish example.
Here's what I should have posted:

def f():
    l = [ None ]
    [ 1 for (l[0], l) in [[1,2]] ]
    print(l)
f()

3.11

$ python3.11 ~/test/test.py 
Traceback (most recent call last):
   ...
UnboundLocalError: cannot access local variable 'l' where it is not associated with a value

This PR

$ ./python ~/test/test.py 
[1]

@markshannon
Copy link
Member

markshannon commented Feb 1, 2023

Actually I don't think LOAD_FAST_AND_CLEAR will work, because of cases like this:

def f(x):
    return [x for x in x]

This shouldn't be a problem. Let's rename for clarity:

def f(x0):
    return [x1 for x1 in x0]

The expression x0 should be fully evaluated before x1 exists.

This PR stashes the variable x0 before evaluating the expression x0, which seems wrong to me.

>>> dis.dis(f)
  1           0 RESUME                   0

  2           2 LOAD_FAST_OR_NULL        0 (x)
              4 BUILD_LIST               0
              6 LOAD_FAST                0 (x)
              8 GET_ITER
        >>   10 FOR_ITER                 4 (to 22)
             14 STORE_FAST               0 (x)
             16 LOAD_FAST                0 (x)
             18 LIST_APPEND              2
             20 JUMP_BACKWARD            6 (to 10)
        >>   22 END_FOR
             24 SWAP                     2
             26 STORE_FAST               0 (x)
             28 RETURN_VALUE

Note that the evaluation of x0 occurs at 6, but x0 is stashed at 2. The two operations should be reversed I think.

I.e.

def f(x):
    return [x for x in x]

should be compiled as

def f(x):
    $tmp = x
    return [x for x in $tmp]

(I think I'd prefer $tmp to be on the stack, but leaving it in local .0 would be fine if that produces cleaner code and makes for easier debugging)

@carljm
Copy link
Contributor Author

carljm commented Feb 1, 2023

Here's what I should have posted

Aha, thanks for the example!

The expression x0 should be fully evaluated before x1 exists.

Yes, I considered above the option of moving the evaluation of the iterable expression to before the stack pushes. I'll revisit that, since we do have to clear locals to maintain semantics. We'll have to either stash it in a local or use SWAP to keep it TOS.

* main:
  pythongh-98831: rewrite PUSH_EXC_INFO and conditional jumps in the instruction definition DSL (python#101481)
  pythongh-98831: Modernize the LOAD_ATTR family (python#101488)
  pythongh-101498 : Fix asyncio.Timeout example in docs (python#101499)
  pythongh-101454: fix documentation for END_ASYNC_FOR (python#101455)
  pythongh-101277: Isolate itertools, add group and _grouper types to module state (python#101302)
  pythongh-101317: Add `ssl_shutdown_timeout` parameter for `asyncio.StreamWriter.start_tls` (python#101335)
  datetime.rst: fix combine() signature (python#101490)
* main: (82 commits)
  pythongh-101670: typo fix in PyImport_ExtendInittab() (python#101723)
  pythonGH-99293: Document that `Py_TPFLAGS_VALID_VERSION_TAG` shouldn't be used. (#pythonGH-101736)
  no-issue: Add Dong-hee Na as the cjkcodecs codeowner (pythongh-101731)
  pythongh-101678: Merge math_1_to_whatever() and math_1() (python#101730)
  pythongh-101678: refactor the math module to use special functions from c11 (pythonGH-101679)
  pythongh-85984: Remove legacy Lib/pty.py code. (python#92365)
  pythongh-98831: Use opcode metadata for stack_effect() (python#101704)
  pythongh-101283: Version was just released, so should be changed in 3.11.3 (pythonGH-101719)
  pythongh-101283: Fix use of unbound variable (pythonGH-101712)
  pythongh-101283: Improved fallback logic for subprocess with shell=True on Windows (pythonGH-101286)
  pythongh-101277: Port more itertools static types to heap types (python#101304)
  pythongh-98831: Modernize CALL and family (python#101508)
  pythonGH-101696: invalidate type version tag in `_PyStaticType_Dealloc` (python#101697)
  pythongh-100221: Fix creating dirs in `make sharedinstall` (pythonGH-100329)
  pythongh-101670: typo fix in PyImport_AppendInittab() (pythonGH-101672)
  pythongh-101196: Make isdir/isfile/exists faster on Windows (pythonGH-101324)
  pythongh-101614: Don't treat python3_d.dll as a Python DLL when checking extension modules for incompatibility (pythonGH-101615)
  pythongh-100933: Improve `check_element` helper in `test_xml_etree` (python#100934)
  pythonGH-101578: Normalize the current exception (pythonGH-101607)
  pythongh-47937: Note that Popen attributes are read-only (python#93070)
  ...
@carljm carljm marked this pull request as draft February 9, 2023 19:25
* main:
  Fix some typos in asdl_c.py (pythonGH-101757)
  pythongh-101747: Fix refleak in new `OrderedDict` repr (pythonGH-101748)
  pythongh-101430: Update tracemalloc to handle presize properly. (pythongh-101745)
  pythonGH-101228: Fix typo in docstring for read method of `_io.TextIOWrapper` class (python#101227)
  Fix typo in `test_fstring.py` (python#101600)
  pythongh-101726: Update the OpenSSL version to 1.1.1t (pythonGH-101727)
  pythongh-101283: Fix 'versionchanged' for the shell=True fallback on Windows in 3.12 (pythonGH-101728)
  LibFFI build requires x64 Cygwin, and skip the ARM build (pythonGH-101743)
@carljm carljm added the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Feb 10, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @carljm for commit 17d5d84 🤖

If you want to schedule another build, you need to add the :hammer: test-with-refleak-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-refleak-buildbots Test PR w/ refleak buildbots; report in status section label Feb 10, 2023
@carljm carljm marked this pull request as ready for review February 10, 2023 06:14
@carljm
Copy link
Contributor Author

carljm commented Feb 10, 2023

@markshannon Ok, I think I've addressed all review comments, and CI looks good. Ready for review and pyperformance run.

Given the current usage of STORE_FAST_MAYBE_NULL, it isn't possible to
construct a repro where this is needed, because if the local was NULL
coming into the comprehension it will already be marked as unsafe coming
out of the comprehension (since the body of the comprehension is always
a loop which may be skipped entirely). But in case this opcode is used
in other contexts in future, it should mark the local unsafe.
* main:
  Docs: Fix getstatus() -> getcode() typos (python#101296)
  Docs: use parameter list for sqlite3.Cursor.execute* (python#101782)
  pythongh-101763: Update bundled copy of libffi to 3.4.4 on Windows (pythonGH-101784)
  pythongh-101517: make bdb avoid looking up in linecache with lineno=None (python#101787)
  pythongh-101759: Update Windows installer to SQLite 3.40.1 (python#101762)
  pythongh-101277: Finalise isolating itertools (pythonGH-101305)
  pythongh-101759: Update macOS installer to SQLite 3.40.1 (python#101761)
* main:
  pythongh-101810: Remove duplicated st_ino calculation (pythonGH-101811)
  pythongh-92547: Purge sqlite3_enable_shared_cache() detection from configure (python#101873)
  pythonGH-100987: Refactor `_PyInterpreterFrame` a bit, to assist generator improvement. (pythonGH-100988)
  pythonGH-87849: Simplify stack effect of SEND and specialize it for generators and coroutines. (pythonGH-101788)
  Correct trivial grammar in reset_mock docs (python#101861)
  pythongh-101845: pyspecific: Fix i18n for availability directive (pythonGH-101846)
  pythongh-89792: Limit test_tools freeze test build parallelism based on the number of cores (python#101841)
  pythongh-85984: Utilize new "winsize" functions from termios in pty tests. (python#101831)
  pythongh-89792: Prevent test_tools from copying 1000M of "source" in freeze test (python#101837)
  Fix typo in test_fstring.py (python#101823)
  pythonGH-101797: allocate `PyExpat_CAPI` capsule on heap (python#101798)
  pythongh-101390: Fix docs for `imporlib.util.LazyLoader.factory` to properly call it a class method (pythonGH-101391)
* main: (60 commits)
  pythongh-102056: Fix a few bugs in error handling of exception printing code (python#102078)
  pythongh-102011: use sys.exception() instead of sys.exc_info() in docs where possible (python#102012)
  pythongh-101566: Sync with zipp 3.14. (pythonGH-102018)
  pythonGH-99818: improve the documentation for zipfile.Path and Traversable (pythonGH-101589)
  pythongh-88233: zipfile: handle extras after a zip64 extra (pythonGH-96161)
  pythongh-101981: Apply HOMEBREW related environment variables (pythongh-102074)
  pythongh-101907: Stop using `_Py_OPCODE` and `_Py_OPARG` macros (pythonGH-101912)
  pythongh-101819: Adapt _io types to heap types, batch 1 (pythonGH-101949)
  pythongh-101981: Build macOS as recommended by the devguide (pythonGH-102070)
  pythongh-97786: Fix compiler warnings in pytime.c (python#101826)
  pythongh-101578: Amend PyErr_{Set,Get}RaisedException docs (python#101962)
  Misc improvements to the float tutorial (pythonGH-102052)
  pythongh-85417: Clarify behaviour on branch cuts in cmath module (python#102046)
  pythongh-100425: Update tutorial docs related to sum() accuracy (FH-101854)
  Add missing 'is' to `cmath.log()` docstring (python#102049)
  pythongh-100210: Correct the comment link for unescaping HTML (python#100212)
  pythongh-97930: Also include subdirectory in makefile. (python#102030)
  pythongh-99735: Use required=True in argparse subparsers example (python#100927)
  Fix incorrectly documented attribute in csv docs (python#101250)
  pythonGH-84783: Make the slice object hashable (pythonGH-101264)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline dict/list/set comprehensions in the compiler for better performance
3 participants