Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-98003: Inline call frames for CALL_FUNCTION_EX #98004

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Oct 6, 2022

@brandtbucher
Copy link
Member

brandtbucher commented Oct 7, 2022

Looks like not much difference:

Benchmarks with tag 'apps':
===========================

Slower (2):
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower

Faster (1):
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster

Benchmark hidden because not significant (1): chameleon

Geometric mean: 1.01x slower

Benchmarks with tag 'math':
===========================

Slower (1):
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower

Faster (2):
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'regex':
============================

Slower (2):
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower

Faster (2):
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster

Geometric mean: 1.00x slower

Benchmarks with tag 'serialize':
================================

Slower (2):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower

Faster (3):
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster

Benchmark hidden because not significant (8): json_dumps, json_loads, pickle_pure_python, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x faster

Benchmarks with tag 'startup':
==============================

Slower (2):
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Geometric mean: 1.00x slower

Benchmarks with tag 'template':
===============================

Slower (3):
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower

Faster (1):
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster

Geometric mean: 1.02x slower

All benchmarks:
===============

Slower (31):
- pickle_list: 3.92 us +- 0.03 us -> 4.11 us +- 0.10 us: 1.05x slower
- unpack_sequence: 45.7 ns +- 0.9 ns -> 47.9 ns +- 0.7 ns: 1.05x slower
- mako: 9.36 ms +- 0.05 ms -> 9.69 ms +- 0.04 ms: 1.04x slower
- deepcopy_reduce: 2.87 us +- 0.04 us -> 2.97 us +- 0.04 us: 1.04x slower
- go: 132 ms +- 1 ms -> 136 ms +- 1 ms: 1.03x slower
- deepcopy_memo: 33.2 us +- 0.3 us -> 34.3 us +- 0.4 us: 1.03x slower
- pidigits: 193 ms +- 0 ms -> 199 ms +- 0 ms: 1.03x slower
- pickle_dict: 30.3 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.03x slower
- json: 4.62 ms +- 0.08 ms -> 4.73 ms +- 0.12 ms: 1.03x slower
- genshi_xml: 48.5 ms +- 0.4 ms -> 49.5 ms +- 0.6 ms: 1.02x slower
- html5lib: 59.3 ms +- 2.4 ms -> 60.4 ms +- 2.4 ms: 1.02x slower
- regex_effbot: 3.33 ms +- 0.02 ms -> 3.39 ms +- 0.02 ms: 1.02x slower
- logging_silent: 90.3 ns +- 0.9 ns -> 91.9 ns +- 1.8 ns: 1.02x slower
- richards: 43.0 ms +- 1.3 ms -> 43.6 ms +- 1.1 ms: 1.02x slower
- genshi_text: 20.9 ms +- 0.3 ms -> 21.2 ms +- 0.3 ms: 1.02x slower
- generators: 73.3 ms +- 0.5 ms -> 74.3 ms +- 0.2 ms: 1.01x slower
- crypto_pyaes: 73.0 ms +- 0.6 ms -> 73.9 ms +- 0.5 ms: 1.01x slower
- deepcopy: 324 us +- 3 us -> 327 us +- 4 us: 1.01x slower
- sqlglot_parse: 1.32 ms +- 0.01 ms -> 1.33 ms +- 0.01 ms: 1.01x slower
- logging_format: 6.32 us +- 0.07 us -> 6.38 us +- 0.12 us: 1.01x slower
- regex_v8: 21.1 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.01x slower
- thrift: 736 us +- 13 us -> 743 us +- 17 us: 1.01x slower
- chaos: 67.4 ms +- 0.8 ms -> 67.9 ms +- 0.9 ms: 1.01x slower
- sympy_sum: 162 ms +- 2 ms -> 163 ms +- 1 ms: 1.01x slower
- tornado_http: 92.9 ms +- 1.3 ms -> 93.4 ms +- 1.4 ms: 1.01x slower
- coroutines: 23.6 ms +- 0.2 ms -> 23.8 ms +- 0.1 ms: 1.01x slower
- aiohttp: 1.00 ms +- 0.00 ms -> 1.01 ms +- 0.00 ms: 1.01x slower
- dulwich_log: 60.9 ms +- 0.4 ms -> 61.1 ms +- 0.4 ms: 1.00x slower
- gunicorn: 1.09 ms +- 0.01 ms -> 1.09 ms +- 0.01 ms: 1.00x slower
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.08 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.45 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Faster (29):
- mdp: 2.70 sec +- 0.01 sec -> 2.53 sec +- 0.01 sec: 1.07x faster
- fannkuch: 382 ms +- 4 ms -> 366 ms +- 5 ms: 1.05x faster
- nqueens: 80.4 ms +- 1.0 ms -> 77.7 ms +- 1.2 ms: 1.03x faster
- xml_etree_iterparse: 105 ms +- 2 ms -> 101 ms +- 1 ms: 1.03x faster
- scimark_sparse_mat_mult: 4.10 ms +- 0.08 ms -> 4.00 ms +- 0.05 ms: 1.03x faster
- spectral_norm: 94.1 ms +- 2.9 ms -> 91.8 ms +- 1.2 ms: 1.03x faster
- telco: 6.50 ms +- 0.11 ms -> 6.36 ms +- 0.12 ms: 1.02x faster
- scimark_fft: 311 ms +- 5 ms -> 305 ms +- 3 ms: 1.02x faster
- pickle: 10.3 us +- 0.3 us -> 10.1 us +- 0.1 us: 1.02x faster
- xml_etree_process: 53.2 ms +- 0.5 ms -> 52.4 ms +- 0.5 ms: 1.01x faster
- pathlib: 17.7 ms +- 0.2 ms -> 17.5 ms +- 0.3 ms: 1.01x faster
- nbody: 96.2 ms +- 2.0 ms -> 94.9 ms +- 1.9 ms: 1.01x faster
- sympy_str: 282 ms +- 6 ms -> 279 ms +- 4 ms: 1.01x faster
- regex_dna: 200 ms +- 1 ms -> 198 ms +- 1 ms: 1.01x faster
- scimark_sor: 103 ms +- 2 ms -> 102 ms +- 1 ms: 1.01x faster
- async_tree_none: 517 ms +- 10 ms -> 512 ms +- 9 ms: 1.01x faster
- django_template: 32.5 ms +- 0.4 ms -> 32.2 ms +- 0.3 ms: 1.01x faster
- sympy_expand: 453 ms +- 7 ms -> 450 ms +- 6 ms: 1.01x faster
- async_tree_io: 1.31 sec +- 0.01 sec -> 1.30 sec +- 0.01 sec: 1.01x faster
- meteor_contest: 103 ms +- 1 ms -> 102 ms +- 2 ms: 1.01x faster
- regex_compile: 128 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x faster
- scimark_monte_carlo: 66.2 ms +- 1.0 ms -> 65.8 ms +- 0.9 ms: 1.01x faster
- float: 71.7 ms +- 0.9 ms -> 71.3 ms +- 1.0 ms: 1.01x faster
- sqlite_synth: 2.59 us +- 0.02 us -> 2.58 us +- 0.03 us: 1.00x faster
- 2to3: 247 ms +- 1 ms -> 246 ms +- 0 ms: 1.00x faster
- hexiom: 5.99 ms +- 0.04 ms -> 5.97 ms +- 0.04 ms: 1.00x faster
- sqlglot_normalize: 104 ms +- 1 ms -> 104 ms +- 1 ms: 1.00x faster
- sympy_integrate: 20.3 ms +- 0.1 ms -> 20.3 ms +- 0.1 ms: 1.00x faster
- sqlglot_optimize: 50.9 ms +- 0.3 ms -> 50.8 ms +- 0.2 ms: 1.00x faster

Benchmark hidden because not significant (26): async_tree_cpu_io_mixed, async_tree_memoization, chameleon, coverage, deltablue, flaskblogging, json_dumps, json_loads, logging_simple, mypy, pickle_pure_python, pprint_safe_repr, pprint_pformat, pycparser, pyflate, pylint, raytrace, scimark_lu, sqlalchemy_declarative, sqlalchemy_imperative, sqlglot_transpile, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_generate

Geometric mean: 1.00x slower

Perhaps we should try inlining all CALL_FUNCTION_EX, rather than specializing? Your call.

@Fidget-Spinner Fidget-Spinner changed the title gh-98003: Specialize CALL_FUNCTION_EX gh-98003: Inline call frames for CALL_FUNCTION_EX Oct 7, 2022
@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Oct 26, 2022

@markshannon gentle ping for a review please.

@markshannon
Copy link
Member

markshannon commented Oct 26, 2022

The approach I would prefer is to:

  1. Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()
  2. Use the factored out code for creating the frame in CALL_FUNCTION_EX

There is a lot unnecessary dismantling of tuples and dicts in the current approach. This is not a problem with this PR, but I think it needs to be sorted out before "inlining" the call in CALL_FUNCTION_EX

Ideally CALL_FUNCTION_EX would look something like this:

CALL_FUNCTION_EX:
   if (Py_IS_TYPE(func, &PyFunction_Type) &&
        tstate->interp->eval_frame == NULL &&
        ((PyFunctionObject *)func)->vectorcall == _PyFunction_Vectorcall) {
                _PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_TD(
                    tstate, (PyFunctionObject *)function, locals, callargs, kwargs);
                );
                Py_DECREF(callargs);
                Py_DECREF(kwargs);
                if (new_frame == NULL) {
                    goto error;
                }
                /* Enter frame */
               ...
   }
   ...

Where _PyEvalFramePushAndInit_TD pushes the frame, and calls the factored out function for initializing the frame from *args, **kwargs.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Oct 27, 2022

The approach I would prefer is to:

  1. Refactor PyFunction_Type.tp_call so that it creates a frame, then calls PyEval_EvalFrame()
  2. Use the factored out code for creating the frame in CALL_FUNCTION_EX

I'm confused. The first step is to create a frame, but then in your example code you show it creating another frame again. Why does it do that?

@markshannon
Copy link
Member

markshannon commented Oct 27, 2022

Because we don't really create a frame. We push a frame and then initialize it.

Pushing a frame is simple. It is initializing that is complex and we want to factor out.

@brandtbucher
Copy link
Member

brandtbucher commented Nov 18, 2022

Just curious, has this stalled out for now? At the very least, the changes in ceval.c will need to move to the new bytecodes.c, and you'll probably want to use the new DISPATCH_INLINED macro for inlining the call.

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Nov 19, 2022

Totally get it if you're busy with school or something right now though! :) Just making sure that you aren't blocked on reviews or design discussions. If so, maybe we can move this forward (I'd really like to see it in)!

Yeah my exams end next week. I plan to start contributing actively again after then!

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Dec 5, 2022

@brandtbucher and @markshannon I think this is ready for your reviews (when you have time of course).

Copy link
Member

@markshannon markshannon left a comment

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags;
PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));

_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate,
Copy link
Member

@markshannon markshannon Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to move the unpacking of the tuple into _PyEvalFramePushAndInit_Ex.

_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit_Ex(tstate, func, locals, callargs, kwargs).

CALL_FUNCTION_EX is quite a large instruction. Pushing more code into _PyEvalFramePushAndInit_Ex would keep it smaller.

Copy link
Member Author

@Fidget-Spinner Fidget-Spinner Dec 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpacking which tuple?

If you mean move the code above in, that would break convention with the rest of the code base. The other inline calls do these

                int code_flags = ((PyCodeObject *)PyFunction_GET_CODE(func))->co_flags;
                PyObject *locals = code_flags & CO_OPTIMIZED ? NULL : Py_NewRef(PyFunction_GET_GLOBALS(func));

before creating the new frame.

Py_INCREF(PyTuple_GET_ITEM(callargs, i));
}
}
_PyInterpreterFrame *new_frame = _PyEvalFramePushAndInit(
Copy link
Member

@markshannon markshannon Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than transforming callargs and kwargs into a shape suitable for _PyEvalFramePushAndInit, it might make sense to push the frame, then unpack the tuple and dict into the the frame without the intermediate objects.
It complicates the errror handling a bit, and the frame will need to be cleaned up if there is an error.

Copy link
Member Author

@Fidget-Spinner Fidget-Spinner Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can work on this in another PR.

@markshannon
Copy link
Member

markshannon commented Dec 9, 2022

Thanks for this PR. Avoiding C stack consumption for these calls is definitely something we want.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Dec 9, 2022

This approach creates a lot of intermediate objects.
I'm not sure if that is a result of the transformations in this PR, or if the original code also did created those intermediate objects.

If the objects were already being created, then maybe we should streamline it in another PR.
@Fidget-Spinner what do you think?

Yes these are the semantics of the original CALL_FUNCTION_EX. Most of these were copy pasted. I was surprised at how allocation heavy CALL_FUNCTION_EX is and have been thinking of ways to improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants