gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Fidget-Spinner · 2022-10-06T23:29:59Z

Issue: Inline call frames for CALL_FUNCTION_EX #98003

…-Spinner/cpython into call_function_ex_inline_3

brandtbucher · 2022-10-07T16:58:42Z

Looks like not much difference:

Benchmarks with tag 'apps':
===========================

Slower (2):
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower

Faster (1):
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster

Benchmark hidden because not significant (1): chameleon

Geometric mean: 1.01x slower

Benchmarks with tag 'math':
===========================

Slower (1):
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower

Faster (2):
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'regex':
============================

Slower (2):
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower

Faster (2):
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'serialize':
================================

Slower (2):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower

Faster (3):
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster

Benchmark hidden because not significant (8): json_dumps, json_loads, pickle_pure_python, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x faster

Benchmarks with tag 'startup':
==============================

Slower (2):
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Geometric mean: 1.00x slower

Benchmarks with tag 'template':
===============================

Slower (3):
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower

Faster (1):
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster

Geometric mean: 1.02x slower

All benchmarks:
===============

Slower (31):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- unpack_sequence: 45.7 ns +- 0.9 ns -> 47.9 ns +- 0.7 ns: 1.05x slower
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- deepcopy_reduce: 2.87 us +- 0.04 us -> 2.97 us +- 0.04 us: 1.04x slower
- go: 132 ms +- 1 ms -> 136 ms +- 1 ms: 1.03x slower
- deepcopy_memo: 33.2 us +- 0.3 us -> 34.3 us +- 0.4 us: 1.03x slower
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower
- json: 4.62 ms +- 0.08 ms -> 4.73 ms +- 0.12 ms: 1.03x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- logging_silent: 90.3 ns +- 0.9 ns -> 91.9 ns +- 1.8 ns: 1.02x slower
- richards: 43.0 ms +- 1.3 ms -> 43.6 ms +- 1.1 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower
- generators: 73.3 ms +- 0.5 ms -> 74.3 ms +- 0.2 ms: 1.01x slower
- crypto_pyaes: 73.0 ms +- 0.6 ms -> 73.9 ms +- 0.5 ms: 1.01x slower
- deepcopy: 324 us +- 3 us -> 327 us +- 4 us: 1.01x slower
- sqlglot_parse: 1.32 ms +- 0.01 ms -> 1.33 ms +- 0.01 ms: 1.01x slower
- logging_format: 6.32 us +- 0.07 us -> 6.38 us +- 0.12 us: 1.01x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower
- thrift: 736 us +- 13 us -> 743 us +- 17 us: 1.01x slower
- chaos: 67.4 ms +- 0.8 ms -> 67.9 ms +- 0.9 ms: 1.01x slower
- sympy_sum: 162 ms +- 2 ms -> 163 ms +- 1 ms: 1.01x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower
- coroutines: 23.6 ms +- 0.2 ms -> 23.8 ms +- 0.1 ms: 1.01x slower
- aiohttp: 1.00 ms +- 0.00 ms -> 1.01 ms +- 0.00 ms: 1.01x slower
- dulwich_log: 60.9 ms +- 0.4 ms -> 61.1 ms +- 0.4 ms: 1.00x slower
- gunicorn: 1.09 ms +- 0.01 ms -> 1.09 ms +- 0.01 ms: 1.00x slower
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Faster (29):
- mdp: 2.70 sec +- 0.01 sec -> 2.53 sec +- 0.01 sec: 1.07x faster
- fannkuch: 382 ms +- 4 ms -> 366 ms +- 5 ms: 1.05x faster
- nqueens: 80.4 ms +- 1.0 ms -> 77.7 ms +- 1.2 ms: 1.03x faster
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- scimark_sparse_mat_mult: 4.10 ms +- 0.08 ms -> 4.00 ms +- 0.05 ms: 1.03x faster
- spectral_norm: 94.1 ms +- 2.9 ms -> 91.8 ms +- 1.2 ms: 1.03x faster
- telco: 6.50 ms +- 0.11 ms -> 6.36 ms +- 0.12 ms: 1.02x faster
- scimark_fft: 311 ms +- 5 ms -> 305 ms +- 3 ms: 1.02x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster
- pathlib: 17.7 ms +- 0.2 ms -> 17.5 ms +- 0.3 ms: 1.01x faster
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- sympy_str: 282 ms +- 6 ms -> 279 ms +- 4 ms: 1.01x faster
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- scimark_sor: 103 ms +- 2 ms -> 102 ms +- 1 ms: 1.01x faster
- async_tree_none: 517 ms +- 10 ms -> 512 ms +- 9 ms: 1.01x faster
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster
- sympy_expand: 453 ms +- 7 ms -> 450 ms +- 6 ms: 1.01x faster
- async_tree_io: 1.31 sec +- 0.01 sec -> 1.30 sec +- 0.01 sec: 1.01x faster
- meteor_contest: 103 ms +- 1 ms -> 102 ms +- 2 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster
- scimark_monte_carlo: 66.2 ms +- 1.0 ms -> 65.8 ms +- 0.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster
- sqlite_synth: 2.59 us +- 0.02 us -> 2.58 us +- 0.03 us: 1.00x faster
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster
- hexiom: 5.99 ms +- 0.04 ms -> 5.97 ms +- 0.04 ms: 1.00x faster
- sqlglot_normalize: 104 ms +- 1 ms -> 104 ms +- 1 ms: 1.00x faster
- sympy_integrate: 20.3 ms +- 0.1 ms -> 20.3 ms +- 0.1 ms: 1.00x faster
- sqlglot_optimize: 50.9 ms +- 0.3 ms -> 50.8 ms +- 0.2 ms: 1.00x faster

Benchmark hidden because not significant (26): async_tree_cpu_io_mixed, async_tree_memoization, chameleon, coverage, deltablue, flaskblogging, json_dumps, json_loads, logging_simple, mypy, pickle_pure_python, pprint_safe_repr, pprint_pformat, pycparser, pyflate, pylint, raytrace, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_transpile, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x slower

Perhaps we should try inlining all CALL_FUNCTION_EX, rather than specializing? Your call.

Fidget-Spinner · 2022-10-26T12:51:53Z

@markshannon gentle ping for a review please.

markshannon · 2022-10-26T14:54:42Z

The approach I would prefer is to:

Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()
Use the factored out code for creating the frame in CALL_FUNCTION_EX

There is a lot unnecessary dismantling of tuples and dicts in the current approach. This is not a problem with this PR, but I think it needs to be sorted out before "inlining" the call in CALL_FUNCTION_EX

Ideally CALL_FUNCTION_EX would look something like this:

CALL_FUNCTION_EX:
   if (Py_IS_TYPE(func, &PyFunction_Type) &&
        tstate->interp->eval_frame == NULL &&
        ((PyFunctionObject *)func)->vectorcall == _PyFunction_Vectorcall) {
                _PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_TD(
                    tstate, (PyFunctionObject *)function, locals, callargs, kwargs);
                );
                Py_DECREF(callargs);
                Py_DECREF(kwargs);
                if (new_frame == NULL) {
                    goto error;
                }
                /* Enter frame */
               ...
   }
   ...

Where _PyEvalFramePushAndInit_TD pushes the frame, and calls the factored out function for initializing the frame from *args, **kwargs.

Fidget-Spinner · 2022-10-27T03:21:46Z

The approach I would prefer is to:

Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()

Use the factored out code for creating the frame in CALL_FUNCTION_EX

I'm confused. The first step is to create a frame, but then in your example code you show it creating another frame again. Why does it do that?

markshannon · 2022-10-27T11:27:06Z

Because we don't really create a frame. We push a frame and then initialize it.

Pushing a frame is simple. It is initializing that is complex and we want to factor out.

brandtbucher · 2022-11-18T19:43:18Z

Just curious, has this stalled out for now? At the very least, the changes in ceval.c will need to move to the new bytecodes.c, and you'll probably want to use the new DISPATCH_INLINED macro for inlining the call.

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

Fidget-Spinner · 2022-11-19T06:39:36Z

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

Yeah my exams end next week. I plan to start contributing actively again after then!

…line_3

Fidget-Spinner · 2022-12-05T05:50:40Z

@brandtbucher and @markshannon I think this is ready for your reviews (when you have time of course).

markshannon

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

markshannon · 2022-12-09T09:42:42Z

Python/bytecodes.c

+                int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags;
+                PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));
+
+                _PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate,


I think it would make sense to move the unpacking of the tuple into _PyEvalFramePushAndInit_Ex.

_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate, func, locals, callargs, kwargs).

CALL_FUNCTION_EX is quite a large instruction. Pushing more code into _PyEvalFramePushAndInit_Ex would keep it smaller.

Unpacking which tuple?

If you mean move the code above in, that would break convention with the rest of the code base. The other inline calls do these

int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags; PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));

before creating the new frame.

markshannon · 2022-12-09T09:42:42Z

Python/ceval.c

+            Py_INCREF(PyTuple_GET_ITEM(callargs, i));
+        }
+    }
+    _PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit(


Rather than transforming callargs and kwargs into a shape suitable for _PyEvalFramePushAndInit, it might make sense to push the frame, then unpack the tuple and dict into the the frame without the intermediate objects.
It complicates the errror handling a bit, and the frame will need to be cleaned up if there is an error.

I can work on this in another PR.

markshannon · 2022-12-09T09:44:27Z

Thanks for this PR. Avoiding C stack consumption for these calls is definitely something we want.

Fidget-Spinner · 2022-12-09T13:20:00Z

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

Yes these are the semantics of the original CALL_FUNCTION_EX. Most of these were copy pasted. I was surprised at how allocation heavy CALL_FUNCTION_EX is and have been thinking of ways to improve it.

Specialize CALL_FUNCTION_EX

830084c

Fidget-Spinner requested a review from markshannon as a code owner Oct 6, 2022

bedevere-bot added the awaiting core review label Oct 6, 2022

Fidget-Spinner requested a review from brandtbucher Oct 6, 2022

📜🤖 Added by blurb_it.

d9fbb11

Fidget-Spinner added the sprint label Oct 6, 2022

Fidget-Spinner added 3 commits Oct 6, 2022

fix refleak

e724666

Merge branch 'call_function_ex_inline_3' of https://github.com/Fidget…

01cbef6

…-Spinner/cpython into call_function_ex_inline_3

re-order STACK_SHRINK to fix error case

92fead8

Fidget-Spinner added 2 commits Oct 7, 2022

Inline instead of specializing

84d0f8b

reduce diff

1955f19

Fidget-Spinner changed the title ~~gh-98003: Specialize CALL_FUNCTION_EX~~ gh-98003: Inline call frames for CALL_FUNCTION_EX Oct 7, 2022

Fidget-Spinner added 3 commits Oct 7, 2022

temper expectations

eb25c1b

Add newline

3f409da

remove unused variable

09b2528

Fidget-Spinner added 3 commits Nov 24, 2022

remove unused function prototype

f0d77dc

Merge remote-tracking branch 'upstream/main' into call_function_ex_in…

898c75d

…line_3

Use new case regenerator

b933f6e

bedevere-bot mentioned this pull request Nov 24, 2022

Inline call frames for CALL_FUNCTION_EX #98003

Open

Fidget-Spinner added 2 commits Nov 30, 2022

Apply Mark's and Brandt's suggestion

06ddc00

Run --regen

bfab4dc

markshannon reviewed Dec 9, 2022

View changes

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Fidget-Spinner commented Oct 6, 2022 •

edited by bedevere-bot

brandtbucher commented Oct 7, 2022

Fidget-Spinner commented Oct 26, 2022

markshannon commented Oct 26, 2022

Fidget-Spinner commented Oct 27, 2022

markshannon commented Oct 27, 2022

brandtbucher commented Nov 18, 2022

Fidget-Spinner commented Nov 19, 2022

Fidget-Spinner commented Dec 5, 2022

markshannon left a comment •

edited

markshannon Dec 9, 2022

Fidget-Spinner Dec 10, 2022

markshannon Dec 9, 2022

Fidget-Spinner Dec 9, 2022 •

edited

markshannon commented Dec 9, 2022

Fidget-Spinner commented Dec 9, 2022

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Are you sure you want to change the base?

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Conversation

Fidget-Spinner commented Oct 6, 2022 • edited by bedevere-bot

brandtbucher commented Oct 7, 2022

Fidget-Spinner commented Oct 26, 2022

markshannon commented Oct 26, 2022

Fidget-Spinner commented Oct 27, 2022

markshannon commented Oct 27, 2022

brandtbucher commented Nov 18, 2022

Fidget-Spinner commented Nov 19, 2022

Fidget-Spinner commented Dec 5, 2022

markshannon left a comment • edited

markshannon Dec 9, 2022

Choose a reason for hiding this comment

Fidget-Spinner Dec 10, 2022

Choose a reason for hiding this comment

markshannon Dec 9, 2022

Choose a reason for hiding this comment

Fidget-Spinner Dec 9, 2022 • edited

Choose a reason for hiding this comment

markshannon commented Dec 9, 2022

Fidget-Spinner commented Dec 9, 2022

Fidget-Spinner commented Oct 6, 2022 •

edited by bedevere-bot

markshannon left a comment •

edited

Fidget-Spinner Dec 9, 2022 •

edited