Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-45256: Remove the usage of the C stack in Python to Python calls #28488

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Sep 21, 2021

  • Remove the usage of the cstack in Python to Python calls
  • Fix gdb
  • Move depth to frame structure for easier retrieval in debugging tools
  • Refactor and simplifications

https://bugs.python.org/issue45256

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Sep 21, 2021

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b4149b6 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

@@ -31,6 +31,7 @@ typedef struct _interpreter_frame {
int f_lasti; /* Last instruction if called */
int stacktop; /* Offset of TOS from localsplus */
PyFrameState f_state; /* What state the frame is in */
int depth; /* Depth of the frame in a ceval loop */
Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

The depth is on the frame object and not on the cframe because this is much easier to fetch from gdb and other debugging tools.

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

Why is tstate->cframe->depth any harder than tstate->frame->depth?

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

Because is not tstate->cframe->depth or tstate->frame->depth what gdb fetches. gdb gets the frame from a local variable from the stack so you just start with the frame pointer. In general tools fetch the frame because that is what they are interested into. If we place it in the cframe now we need a way to correlate the frames with the number, and that relationship is one to many, which is more difficult to maintain correctly. Gdb also constructs the frame from a pointer, but if we now need the cframe this forces it to also take it as an argument, and that is not always available in the places where frames are constructed, like when you walk the list backwards or forwards, creating new objects.

Also, if you place it in tstate->cframe->depth now you need to fetch both (cframe and frame) and match them accordingly, which is much more complicated that it needs to be, and this also forces the tool to keep the cframe struct up to date (when not using gdb, but when using DWARF or the offsets in the struct).

Copy link
Contributor

@markshannon markshannon left a comment

Looking good. A few minor issues.

Tools/gdb/libpython.py Outdated Show resolved Hide resolved
static InterpreterFrame*
_PyEval_FrameFromPyFunctionAndArgs(PyThreadState *tstate, PyObject* const *args, int nargs, PyObject *function) {
assert(PyFunction_Check(function));
size_t nargsf = nargs | PY_VECTORCALL_ARGUMENTS_OFFSET;
Copy link
Contributor

@markshannon markshannon Sep 21, 2021

nargsf and vector_nargs below are only used in asserts.
I think all the asserts and assignments to nargsf and vector_nargs could be replaced with assert(nargs == 0 || args != NULL);

@@ -31,6 +31,7 @@ typedef struct _interpreter_frame {
int f_lasti; /* Last instruction if called */
int stacktop; /* Offset of TOS from localsplus */
PyFrameState f_state; /* What state the frame is in */
int depth; /* Depth of the frame in a ceval loop */
Copy link
Contributor

@markshannon markshannon Sep 21, 2021

This is not really a property of the frame, but its place in the call stack. A generator frame could have different depths during its lifetime.
It also takes space and time on the fast path of calls.
Could you move the depth back to the CFrame?

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

No, check this comment:

#28488 (comment)

This is here because using this information from debugging tools is much much easier this way. While I was fixing the gdb stuff, it became clear that having it into the cframe generates a considerable amount of complexity.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

I profiled this and it has no measurable impact on runtime performance.

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

That's probably more a comment on our ability to measure small changes 😞
We can always change it later if it is slower.

What I'm more worried about is the redundancy and bugs due to out of date depth information on frames that outlive the call.

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

What we could do is, instead of having a depth value, have an entry flag.
On entry to _PyEval_EvalFrameDefault, set the entry flag, and unset it on exit.
I think that would be safer, and no slower.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

What I'm more worried about is the redundancy and bugs due to out of date depth information on frames that outlive the call.

As in generators or coroutines? Those will always have depth=0

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

At the moment sure, but we will want to flatten next/yield in the same way as we are doing with call/return here.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

but we will want to flatten next/yield in the same way as we are doing with call/return here.

But then it will be easy: on the exit path set it to -1 or 0.

@@ -5309,7 +5382,7 @@ get_exception_handler(PyCodeObject *code, int index, int *level, int *handler, i
static int
initialize_locals(PyThreadState *tstate, PyFrameConstructor *con,
PyObject **localsplus, PyObject *const *args,
Py_ssize_t argcount, PyObject *kwnames)
Py_ssize_t argcount, PyObject *kwnames, int steal_args)
Copy link
Contributor

@markshannon markshannon Sep 21, 2021

When do we not steal references? Would it be cleaner to have this function always steal the references, and increment them in the caller when necessary?

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

The cleanup would be more complex, see #28488 (comment)

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

I don't agree.
Cleanup is simpler if initialize_locals is consistent and always steals the references to args regardless of whether it succeeds or fails.
initialize_locals (or any other function for that matter) should leave things in an consistent state.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

But then what do you suggest? That initialize_locals always steals and _PyEvalFramePushAndInit is the one that increases or decreases the refcount?

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

I'm suggesting that both functions have consistent behavior w.r.t. reference counts regardless of whether they succeed or fail.
Otherwise they are just traps for the unwary.

I think it will make the code more efficient, and probably cleaner, if you make both steal the references to args.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

I am not fully convinced of what you want to achieve here. Who is in charge of deciding if the caller owns back the references? The caller of _PyEvalFramePushAndInit?

Also, the code to retake ownership will be a bit complex because part of what we steal is not flattened in frame->localsplus but sometimes in a dictionary on frame->localsplus (in case of the variables passed by keywords).
So what you propose forces us to increment the reference count on the frame->localsplus array + handle the keyword dict and iterate over that using the dict API, which is not optimal.

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

Also, this function doesn't leave anything "inconsistent". The contract is easy:

On failure, the caller always owns all references to the arguments. On success, the caller owns them if steal is false.

Copy link
Contributor

@markshannon markshannon Sep 22, 2021

No other function has such a contract. They either steal all the references, or they steal none of them.
Anything more complicated than that is too easy to misuse.

These functions will get more use as we extend this to CALL_FUNCTION_KW, CALL_METHOD and CALL_METHOD_KW, so they need an easy to use interface.

Copy link
Member Author

@pablogsal pablogsal Sep 22, 2021

Ok, I trust your judgement but I would like to discuss this together so we are on the same page because I still have some concerns. Let's schedule a call.

Copy link
Contributor

@ctismer ctismer Sep 23, 2021

May I add my 2 cent? I have been bitten very much with such inconsistencies, a few (20) years ago. Consistency with simplicity is a very good concept.

if (initialize_locals(tstate, con, localsarray, args, argcount, kwnames, steal_args)) {
if (steal_args) {
for (int i = 0; i < frame->stacktop; i++) {
Py_XINCREF(frame->localsplus[i]);
Copy link
Contributor

@markshannon markshannon Sep 21, 2021

Why is this necessary?

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

So if the function fails the caller still owns all references. This simplifies a lot of failure scenarios. For instance, imagine that yo have a function that takes 5 parameters but you pass 10. In this case, the function will return NULL but it would have only stolen 5 parameters, while the caller is still in charge of decrementing the other 5. But the problem is that the caller needs to know if this was the problem or something else (because the error could have been something different). THis way, on failure, the caller always have to decrement the references, which is very easy to do in the normal paths

Copy link
Contributor

@markshannon markshannon Sep 21, 2021

When you say "the function" do you mean _PyEvalFramePushAndInit or initialize_locals?

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

_PyEvalFramePushAndInit with steal = 1

Copy link
Member Author

@pablogsal pablogsal Sep 21, 2021

But the argument can be extended to initialize_locals because _PyEvalFramePushAndInit doesn't know what went wrong if it returns NULL

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Sep 21, 2021

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 21, 2021

@markshannon Damn, unfortunately something is going on with Windows:

D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: The command "setlocal
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: set PYTHONPATH=D:\a\cpython\cpython\Lib
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB3073: "D:\a\cpython\cpython\PCbuild\win32\python.exe" "D:\a\cpython\cpython\PC\validate_ucrtbase.py" ucrtbase" exited with code -1073741819.
D:\a\cpython\cpython\PCbuild\python.vcxproj(125,5): warning MSB4181: The "Exec" task returned false but did not log an error.
D:\a\cpython\cpython\PCbuild\regen.targets(107,5): error MSB3073: The command "D:\a\cpython\cpython\PCbuild\win32\python.exe Programs\freeze_test_frozenmain.py Programs/test_frozenmain.h" exited with code -1073741819. [D:\a\cpython\cpython\PCbuild\python.vcxproj]
    6 Warning(s)
    1 Error(s)

Time Elapsed 00:04:54.99

Apparently that is a File System error but I don't understand why this may be happening :(

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 21, 2021

@markshannon Oh, I found the source of the Windows failure: it was an existing bug in the PREDICT() macro for the code path without computed gotos. Fixed it in 50276ad

That was challenging to find indeed!

@bedevere-bot
Copy link

@bedevere-bot bedevere-bot commented Sep 21, 2021

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 50276ad 🤖

If you want to schedule another build, you need to add the "🔨 test-with-buildbots" label again.

Copy link
Contributor

@Fidget-Spinner Fidget-Spinner left a comment

This is really cool. I just have a few comments on readability. No idea why ASAN is failing tbh.

One other concern I have is that this slows down non-python function calls very slightly (with the two additional checks). I highly doubt it's measurable on pyperformance though.

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated
Py_DECREF(function);
_PyFrame_SetStackPointer(frame, stack_pointer);
new_frame->depth = frame->depth + 1;
tstate->frame = frame = new_frame;
Copy link
Contributor

@Fidget-Spinner Fidget-Spinner Sep 21, 2021

Note to self, no need to pop frame here because it's done on eval frame exit at exit_eval_frame.

Python/ceval.c Outdated Show resolved Hide resolved
Python/ceval.c Outdated Show resolved Hide resolved
Tools/gdb/libpython.py Outdated Show resolved Hide resolved
@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 21, 2021

I cannot reproduce the ASAN bug. :(

@Fidget-Spinner
Copy link
Contributor

@Fidget-Spinner Fidget-Spinner commented Sep 21, 2021

There're refleaks but I can't pinpoint where :(

python.exe -m test test_grammar test_builtin -R 3:3
0:00:00 Run tests sequentially
0:00:00 [1/2] test_grammar
beginning 6 repetitions
123456
......
test_grammar leaked [12, 12, 12] references, sum=36
0:00:01 [2/2/1] test_builtin -- test_grammar failed (reference leak)
beginning 6 repetitions
123456
......
test_builtin leaked [13, 13, 13] references, sum=39
test_builtin failed (reference leak)

== Tests result: FAILURE ==

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 21, 2021

Doesn't work for me either
Line 1618 this file, this commit.

Updated

@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 21, 2021

There're refleaks but I can't pinpoint where :(

The refleaks are on the main branch, opened: #28493

Tools/gdb/libpython.py Outdated Show resolved Hide resolved
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Dennis Sweeney <36520290+sweeneyde@users.noreply.github.com>
@pablogsal pablogsal changed the title bpo-45256: Remove the usage of the cstack in Python to Python calls bpo-45256: Remove the usage of the C stack in Python to Python calls Sep 21, 2021
@pablogsal
Copy link
Member Author

@pablogsal pablogsal commented Sep 22, 2021

+-------------------------+------------------------------------+---------------------------------------------+
| Benchmark               | 2021-09-21_23-09-main-8f943ca25732 | 2021-09-21_16-54-simple_cframe-d7a99c1f6ccb |
+=========================+====================================+=============================================+
| spectral_norm           | 153 ms                             | 140 ms: 1.09x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_sparse_mat_mult | 5.80 ms                            | 5.47 ms: 1.06x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| unpack_sequence         | 55.0 ns                            | 51.8 ns: 1.06x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_fft             | 442 ms                             | 425 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| nbody                   | 153 ms                             | 147 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| chaos                   | 102 ms                             | 98 ms: 1.04x faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_dna               | 252 ms                             | 243 ms: 1.04 faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| fannkuch                | 513 ms                             | 493 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_sor             | 196 ms                             | 190 ms: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| chameleon               | 9.96 ms                            | 9.66 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_effbot            | 3.75 ms                            | 3.64 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_pure_python      | 476 us                             | 462 us: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| scimark_monte_carlo     | 102 ms                             | 99 ms: 1.03x faster                         |
+-------------------------+------------------------------------+---------------------------------------------+
| json_loads              | 33.5 us                            | 32.0 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_list             | 4.81 us                            | 4.66 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| unpickle_list           | 5.91 us                            | 5.73 us: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| regex_compile           | 184 ms                             | 178 ms: 1.03x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| json_dumps              | 16.4 ms                            | 15.9 ms: 1.03x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_process       | 82.1 ms                            | 81.1 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| telco                   | 7.70 ms                            | 7.62 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| nqueens                 | 109 ms                             | 104 ms: 1.04x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| richards                | 77.0 ms                            | 74.3 ms: 1.04x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_generate      | 111 ms                             | 108 ms: 1.02x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sqlalchemy_imperative   | 23.5 ms                            | 23.0 ms: 1.02x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_parse         | 185 ms                             | 181 ms: 1.02x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle                  | 12.5 us                            | 12.4 us: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| mako                    | 16.7 ms                            | 16.6 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pidigits                | 236 ms                             | 234 ms: 1.01x faster                        |
+-------------------------+------------------------------------+---------------------------------------------+
| hexiom                  | 9.39 ms                            | 9.34 ms: 1.01x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| django_template         | 49.8 ms                            | 49.7 ms: 1.00x faster                       |
+-------------------------+------------------------------------+---------------------------------------------+
| pickle_dict             | 31.9 us                            | 32.0 us: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| python_startup          | 11.8 ms                            | 11.8 ms: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| python_startup_no_site  | 8.41 ms                            | 8.44 ms: 1.00x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_sum               | 207 ms                             | 208 ms: 1.00x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_expand            | 605 ms                             | 608 ms: 1.00x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| xml_etree_iterparse     | 124 ms                             | 125 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| sympy_str               | 362 ms                             | 366 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| raytrace                | 434 ms                             | 440 ms: 1.01x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| unpickle_pure_python    | 339 us                             | 344 us: 1.02x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| logging_format          | 7.66 us                            | 7.82 us: 1.02x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| go                      | 215 ms                             | 220 ms: 1.02x slower                        |
+-------------------------+------------------------------------+---------------------------------------------+
| deltablue               | 6.50 ms                            | 6.67 ms: 1.03x slower                       |
+-------------------------+------------------------------------+---------------------------------------------+
| Geometric mean          | (ref)                              | 1.03x faster                                |
+-------------------------+------------------------------------+---------------------------------------------+

Benchmark hidden because not significant (16): sqlite_synth, unpickle, scimark_lu, float, regex_v8, crypto_pyaes, tornado_http, pyflate, dulwich_log, pathlib, meteor_contest, 2to3, sqlalchemy_declarative, sympy_integrate, logging_simple, logging_silent

if ((code->co_flags & (CO_GENERATOR | CO_COROUTINE | CO_ASYNC_GENERATOR)) == 0) {
InterpreterFrame *new_frame = _PyEval_FrameFromPyFunctionAndArgs(tstate, stack_pointer+1, oparg, function);
if (new_frame == NULL) {
// When we exit here, we own all variables in the stack (the frame creation has not stolen
Copy link
Contributor

@Fidget-Spinner Fidget-Spinner Sep 23, 2021

Wait, since you called STACK_SHRINK(oparg+1) above, don't you need to Py_DECREF all the function arguments and function objects too? I don't think error will decref them automatically for you since it looks from TOS down.

Copy link
Contributor

@markshannon markshannon Sep 23, 2021

That's true (and my fault).

We need to make functions that steal references, always steal those references.
Otherwise this sort of error is just too easy to make.

Python/ceval.c Outdated Show resolved Hide resolved
Co-authored-by: Brett Cannon <brett@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment