bpo-45256: Remove the usage of the C stack in Python to Python calls #28488

pablogsal · 2021-09-21T10:42:58Z

Remove the usage of the cstack in Python to Python calls
Fix gdb
Move depth to frame structure for easier retrieval in debugging tools
Refactor and simplifications

https://bugs.python.org/issue45256

bedevere-bot · 2021-09-21T10:47:52Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b4149b6 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

pablogsal · 2021-09-21T10:49:11Z

Include/internal/pycore_frame.h

@@ -31,6 +31,7 @@ typedef struct _interpreter_frame {
    int f_lasti;       /* Last instruction if called */
    int stacktop;     /* Offset of TOS from localsplus  */
    PyFrameState f_state;  /* What state the frame is in */
+    int depth; /* Depth of the frame in a ceval loop */


The depth is on the frame object and not on the cframe because this is much easier to fetch from gdb and other debugging tools.

Why is tstate->cframe->depth any harder than tstate->frame->depth?

Because is not tstate->cframe->depth or tstate->frame->depth what gdb fetches. gdb gets the frame from a local variable from the stack so you just start with the frame pointer. In general tools fetch the frame because that is what they are interested into. If we place it in the cframe now we need a way to correlate the frames with the number, and that relationship is one to many, which is more difficult to maintain correctly. Gdb also constructs the frame from a pointer, but if we now need the cframe this forces it to also take it as an argument, and that is not always available in the places where frames are constructed, like when you walk the list backwards or forwards, creating new objects.

Also, if you place it in tstate->cframe->depth now you need to fetch both (cframe and frame) and match them accordingly, which is much more complicated that it needs to be, and this also forces the tool to keep the cframe struct up to date (when not using gdb, but when using DWARF or the offsets in the struct).

markshannon

Looking good. A few minor issues.

Tools/gdb/libpython.py

markshannon · 2021-09-21T11:04:50Z

Python/ceval.c

+static InterpreterFrame*
+_PyEval_FrameFromPyFunctionAndArgs(PyThreadState *tstate, PyObject* const *args, int nargs, PyObject *function) {
+    assert(PyFunction_Check(function));
+    size_t nargsf = nargs | PY_VECTORCALL_ARGUMENTS_OFFSET;


nargsf and vector_nargs below are only used in asserts.
I think all the asserts and assignments to nargsf and vector_nargs could be replaced with assert(nargs == 0 || args != NULL);

markshannon · 2021-09-21T11:04:50Z

Include/internal/pycore_frame.h

@@ -31,6 +31,7 @@ typedef struct _interpreter_frame {
    int f_lasti;       /* Last instruction if called */
    int stacktop;     /* Offset of TOS from localsplus  */
    PyFrameState f_state;  /* What state the frame is in */
+    int depth; /* Depth of the frame in a ceval loop */


This is not really a property of the frame, but its place in the call stack. A generator frame could have different depths during its lifetime.
It also takes space and time on the fast path of calls.
Could you move the depth back to the CFrame?

No, check this comment:

#28488 (comment)

This is here because using this information from debugging tools is much much easier this way. While I was fixing the gdb stuff, it became clear that having it into the cframe generates a considerable amount of complexity.

I profiled this and it has no measurable impact on runtime performance.

That's probably more a comment on our ability to measure small changes 😞
We can always change it later if it is slower.

What I'm more worried about is the redundancy and bugs due to out of date depth information on frames that outlive the call.

What we could do is, instead of having a depth value, have an entry flag.
On entry to _PyEval_EvalFrameDefault, set the entry flag, and unset it on exit.
I think that would be safer, and no slower.

What I'm more worried about is the redundancy and bugs due to out of date depth information on frames that outlive the call.

As in generators or coroutines? Those will always have depth=0

At the moment sure, but we will want to flatten next/yield in the same way as we are doing with call/return here.

but we will want to flatten next/yield in the same way as we are doing with call/return here.

But then it will be easy: on the exit path set it to -1 or 0.

markshannon · 2021-09-21T11:04:50Z

Python/ceval.c

@@ -5309,7 +5382,7 @@ get_exception_handler(PyCodeObject *code, int index, int *level, int *handler, i
 static int
 initialize_locals(PyThreadState *tstate, PyFrameConstructor *con,
    PyObject **localsplus, PyObject *const *args,
-    Py_ssize_t argcount, PyObject *kwnames)
+    Py_ssize_t argcount, PyObject *kwnames, int steal_args)


When do we not steal references? Would it be cleaner to have this function always steal the references, and increment them in the caller when necessary?

The cleanup would be more complex, see #28488 (comment)

I don't agree.
Cleanup is simpler if initialize_locals is consistent and always steals the references to args regardless of whether it succeeds or fails.
initialize_locals (or any other function for that matter) should leave things in an consistent state.

But then what do you suggest? That initialize_locals always steals and _PyEvalFramePushAndInit is the one that increases or decreases the refcount?

I'm suggesting that both functions have consistent behavior w.r.t. reference counts regardless of whether they succeed or fail.
Otherwise they are just traps for the unwary.

I think it will make the code more efficient, and probably cleaner, if you make both steal the references to args.

I am not fully convinced of what you want to achieve here. Who is in charge of deciding if the caller owns back the references? The caller of _PyEvalFramePushAndInit?

Also, the code to retake ownership will be a bit complex because part of what we steal is not flattened in frame->localsplus but sometimes in a dictionary on frame->localsplus (in case of the variables passed by keywords).
So what you propose forces us to increment the reference count on the frame->localsplus array + handle the keyword dict and iterate over that using the dict API, which is not optimal.

Also, this function doesn't leave anything "inconsistent". The contract is easy:

On failure, the caller always owns all references to the arguments. On success, the caller owns them if steal is false.

No other function has such a contract. They either steal all the references, or they steal none of them.
Anything more complicated than that is too easy to misuse.

These functions will get more use as we extend this to CALL_FUNCTION_KW, CALL_METHOD and CALL_METHOD_KW, so they need an easy to use interface.

Ok, I trust your judgement but I would like to discuss this together so we are on the same page because I still have some concerns. Let's schedule a call.

May I add my 2 cent? I have been bitten very much with such inconsistencies, a few (20) years ago. Consistency with simplicity is a very good concept.

markshannon · 2021-09-21T11:04:50Z

Python/ceval.c

+    if (initialize_locals(tstate, con, localsarray, args, argcount, kwnames, steal_args)) {
+        if (steal_args) {
+            for (int i = 0; i < frame->stacktop; i++) {
+                Py_XINCREF(frame->localsplus[i]);


Why is this necessary?

So if the function fails the caller still owns all references. This simplifies a lot of failure scenarios. For instance, imagine that yo have a function that takes 5 parameters but you pass 10. In this case, the function will return NULL but it would have only stolen 5 parameters, while the caller is still in charge of decrementing the other 5. But the problem is that the caller needs to know if this was the problem or something else (because the error could have been something different). THis way, on failure, the caller always have to decrement the references, which is very easy to do in the normal paths

When you say "the function" do you mean _PyEvalFramePushAndInit or initialize_locals?

_PyEvalFramePushAndInit with steal = 1

But the argument can be extended to initialize_locals because _PyEvalFramePushAndInit doesn't know what went wrong if it returns NULL

bedevere-bot · 2021-09-21T11:04:54Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

pablogsal · 2021-09-21T11:12:06Z

@markshannon Damn, unfortunately something is going on with Windows:

D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: The command "setlocal
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: set PYTHONPATH=D:\a\cpython\cpython\Lib
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: "D:\a\cpython\cpython\PCbuild\win32\python.exe" "D:\a\cpython\cpython\PC\validate_ucrtbase.py" ucrtbase" exited with code -1073741819.
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB4181: The "Exec" task returned false but did not log an error.
D:\a\cpython\cpython\PCbuild\regen.targets(107,5): error MSB3073: The command "D:\a\cpython\cpython\PCbuild\win32\python.exe Programs\freeze_test_frozenmain.py Programs/test_frozenmain.h" exited with code -1073741819. [D:\a\cpython\cpython\PCbuild\python.vcxproj]
    6 Warning(s)
    1 Error(s)

Time Elapsed 00:04:54.99

Apparently that is a File System error but I don't understand why this may be happening :(

pablogsal · 2021-09-21T13:24:45Z

@markshannon Oh, I found the source of the Windows failure: it was an existing bug in the PREDICT() macro for the code path without computed gotos. Fixed it in 50276ad

That was challenging to find indeed!

bedevere-bot · 2021-09-21T13:38:11Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 50276ad 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

Fidget-Spinner

This is really cool. I just have a few comments on readability. No idea why ASAN is failing tbh.

One other concern I have is that this slows down non-python function calls very slightly (with the two additional checks). I highly doubt it's measurable on pyperformance though.

Python/ceval.c

Fidget-Spinner · 2021-09-21T14:46:23Z

Python/ceval.c

+            Py_DECREF(function);
+            _PyFrame_SetStackPointer(frame, stack_pointer);
+            new_frame->depth = frame->depth + 1;
+            tstate->frame = frame = new_frame;


Note to self, no need to pop frame here because it's done on eval frame exit at exit_eval_frame.

Python/ceval.c

Tools/gdb/libpython.py

pablogsal · 2021-09-21T14:57:16Z

I cannot reproduce the ASAN bug. :(

Fidget-Spinner · 2021-09-21T16:15:46Z

There're refleaks but I can't pinpoint where :(

python.exe -m test test_grammar test_builtin -R 3:3
0:00:00 Run tests sequentially
0:00:00 [1/2] test_grammar
beginning 6 repetitions
123456
......
test_grammar leaked [12, 12, 12] references, sum=36
0:00:01 [2/2/1] test_builtin -- test_grammar failed (reference leak)
beginning 6 repetitions
123456
......
test_builtin leaked [13, 13, 13] references, sum=39
test_builtin failed (reference leak)

== Tests result: FAILURE ==

Ths commit inlines calls to Python functions in the eval loop and steals all the arguments in the call from the caller for performance.

…uted gotos

pablogsal · 2021-09-21T16:31:47Z

Doesn't work for me either
Line 1618 this file, this commit.

Updated

pablogsal · 2021-09-21T16:43:08Z

There're refleaks but I can't pinpoint where :(

The refleaks are on the main branch, opened: #28493

Tools/gdb/libpython.py

Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>

pablogsal · 2021-09-22T14:58:16Z

+-------------------------+------------------------------------+---------------------------------------------+
| Benchmark               | 2021-09-21_23-09-main-8f943ca25732 | 2021-09-21_16-54-simple_cframe-d7a99c1f6ccb |
+=========================+====================================+=============================================+
| spectral_norm           | 153 ms                             | 140 ms: 1.09x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_sparse_mat_mult | 5.80 ms                            | 5.47 ms: 1.06x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| unpack_sequence         | 55.0 ns                            | 51.8 ns: 1.06x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_fft             | 442 ms                             | 425 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| nbody                   | 153 ms                             | 147 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| chaos                   | 102 ms                             | 98 ms: 1.04x faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_dna               | 252 ms                             | 243 ms: 1.04 faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| fannkuch                | 513 ms                             | 493 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_sor             | 196 ms                             | 190 ms: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| chameleon               | 9.96 ms                            | 9.66 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_effbot            | 3.75 ms                            | 3.64 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_pure_python      | 476 us                             | 462 us: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_monte_carlo     | 102 ms                             | 99 ms: 1.03x faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| json_loads              | 33.5 us                            | 32.0 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_list             | 4.81 us                            | 4.66 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| unpickle_list           | 5.91 us                            | 5.73 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_compile           | 184 ms                             | 178 ms: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| json_dumps              | 16.4 ms                            | 15.9 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_process       | 82.1 ms                            | 81.1 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| telco                   | 7.70 ms                            | 7.62 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| nqueens                 | 109 ms                             | 104 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| richards                | 77.0 ms                            | 74.3 ms: 1.04x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_generate      | 111 ms                             | 108 ms: 1.02x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sqlalchemy_imperative   | 23.5 ms                            | 23.0 ms: 1.02x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_parse         | 185 ms                             | 181 ms: 1.02x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle                  | 12.5 us                            | 12.4 us: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| mako                    | 16.7 ms                            | 16.6 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pidigits                | 236 ms                             | 234 ms: 1.01x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| hexiom                  | 9.39 ms                            | 9.34 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| django_template         | 49.8 ms                            | 49.7 ms: 1.00x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_dict             | 31.9 us                            | 32.0 us: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| python_startup          | 11.8 ms                            | 11.8 ms: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| python_startup_no_site  | 8.41 ms                            | 8.44 ms: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_sum               | 207 ms                             | 208 ms: 1.00x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_expand            | 605 ms                             | 608 ms: 1.00x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_iterparse     | 124 ms                             | 125 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_str               | 362 ms                             | 366 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| raytrace                | 434 ms                             | 440 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| unpickle_pure_python    | 339 us                             | 344 us: 1.02x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| logging_format          | 7.66 us                            | 7.82 us: 1.02x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| go                      | 215 ms                             | 220 ms: 1.02x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| deltablue               | 6.50 ms                            | 6.67 ms: 1.03x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| Geometric mean          | (ref)                              | 1.03x faster                                |
+-------------------------+------------------------------------+---------------------------------------------+

Benchmark hidden because not significant (16): sqlite_synth, unpickle, scimark_lu, float, regex_v8, crypto_pyaes, tornado_http, pyflate, dulwich_log, pathlib, meteor_contest, 2to3, sqlalchemy_declarative, sympy_integrate, logging_simple, logging_silent

Drop test for tracing and inline coroutine creation

Fidget-Spinner · 2021-09-23T15:26:04Z

Python/ceval.c

+                if ((code->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) == 0) {
+                    InterpreterFrame *new_frame = _PyEval_FrameFromPyFunctionAndArgs(tstate, stack_pointer+1, oparg, function);
+                    if (new_frame == NULL) {
+                        // When we exit here, we own all variables in the stack (the frame creation has not stolen


Wait, since you called STACK_SHRINK(oparg+1) above, don't you need to Py_DECREF all the function arguments and function objects too? I don't think error will decref them automatically for you since it looks from TOS down.

That's true (and my fault).

We need to make functions that steal references, always steal those references.
Otherwise this sort of error is just too easy to make.

Python/ceval.c

Co-authored-by: Brett Cannon <brett@python.org>

pablogsal requested a review from markshannon as a code owner Sep 21, 2021

the-knights-who-say-ni added the CLA signed label Sep 21, 2021

bedevere-bot added the awaiting core review label Sep 21, 2021

pablogsal added skip news 🔨 test-with-buildbots and removed 🔨 test-with-buildbots labels Sep 21, 2021

bedevere-bot removed the 🔨 test-with-buildbots label Sep 21, 2021

pablogsal reviewed Sep 21, 2021

View changes

pablogsal force-pushed the simple_cframe branch from b4149b6 to 415273d Sep 21, 2021

markshannon requested changes Sep 21, 2021

View changes

bedevere-bot added awaiting changes and removed awaiting core review labels Sep 21, 2021

pablogsal added the 🔨 test-with-buildbots label Sep 21, 2021

bedevere-bot removed the 🔨 test-with-buildbots label Sep 21, 2021

Fidget-Spinner reviewed Sep 21, 2021

View changes

pablogsal force-pushed the simple_cframe branch from 98fced8 to 21b7d29 Sep 21, 2021

pablogsal added 5 commits Sep 21, 2021

Remove the usage of the cstack in Python to Python calls

8a88fd3

Ths commit inlines calls to Python functions in the eval loop and steals all the arguments in the call from the caller for performance.

Fix gdb

799e366

Move depth to frame structure for easier retrieval in debugging tools

306e29f

Refactor and simplifications

fc36be1

Fix existing bug in the PREDICT() macro for the codepath without comp…

e1091e0

…uted gotos

pablogsal force-pushed the simple_cframe branch from 21b7d29 to e1091e0 Sep 21, 2021

sweeneyde reviewed Sep 21, 2021

View changes

Tools/gdb/libpython.py Outdated Show resolved Hide resolved

Apply suggestions from code review

d7a99c1

Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com> Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>

pablogsal changed the title ~~bpo-45256: Remove the usage of the cstack in Python to Python calls~~ bpo-45256: Remove the usage of the C stack in Python to Python calls Sep 21, 2021

Fidget-Spinner mentioned this pull request Sep 22, 2021

Implement CALL_FUNCTION specialization "family" for PEP 659 adaptive interpreter faster-cpython/ideas#54

Open

markshannon and others added 5 commits Sep 23, 2021

Tracing of Python functions is handled at entry not call.

0721cea

Inline creation of coroutines.

a2c0994

Merge remote-tracking branch 'upstream/main' into simple_cframe

832bccf

Reorder flow in CALL_FUNCTION to make it a bit clearer.

9a63ee1

Merge pull request #16 from markshannon/simple_cframe

919b741

Drop test for tracing and inline coroutine creation

Fidget-Spinner reviewed Sep 23, 2021

View changes

brettcannon reviewed Sep 23, 2021

View changes

Python/ceval.c Outdated Show resolved Hide resolved

Update Python/ceval.c

9210541

Co-authored-by: Brett Cannon <brett@python.org>

python / cpython Public

bpo-45256: Remove the usage of the C stack in Python to Python calls #28488

Are you sure you want to change the base?

bpo-45256: Remove the usage of the C stack in Python to Python calls #28488

Conversation

pablogsal commented Sep 21, 2021 • edited by bedevere-bot

bedevere-bot commented Sep 21, 2021

pablogsal Sep 21, 2021 • edited

markshannon left a comment

markshannon Sep 21, 2021 • edited

pablogsal Sep 21, 2021 • edited

bedevere-bot commented Sep 21, 2021

pablogsal commented Sep 21, 2021

pablogsal commented Sep 21, 2021

bedevere-bot commented Sep 21, 2021

Fidget-Spinner left a comment

pablogsal commented Sep 21, 2021 • edited

Fidget-Spinner commented Sep 21, 2021

pablogsal commented Sep 21, 2021

pablogsal commented Sep 21, 2021 • edited

pablogsal commented Sep 22, 2021 • edited

Fidget-Spinner Sep 23, 2021 • edited

pablogsal commented Sep 21, 2021 •

edited by bedevere-bot

pablogsal Sep 21, 2021 •

edited

markshannon Sep 21, 2021 •

edited

pablogsal Sep 21, 2021 •

edited

pablogsal commented Sep 21, 2021 •

edited

pablogsal commented Sep 21, 2021 •

edited

pablogsal commented Sep 22, 2021 •

edited

Fidget-Spinner Sep 23, 2021 •

edited