Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower string concatenation in CPython 3.11 #99862

Closed
zephyr111 opened this issue Nov 29, 2022 · 6 comments
Closed

Slower string concatenation in CPython 3.11 #99862

zephyr111 opened this issue Nov 29, 2022 · 6 comments
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@zephyr111
Copy link

zephyr111 commented Nov 29, 2022

Hello,

We have found a regression between CPython 3.10.8 and CPython 3.11 resulting in string concatenation to be significantly slower in loops on Windows 10. This is described in details in this StackOverflow post.

Here is a minimal, reproducible example of benchmarking code:

import time
a = 'a'

start = time.time()
for _ in range(1000000):
    a += 'a'
end = time.time()

print(a[:5], (end-start) * 1000)

CPython 3.11.0 is about 100 times slower than CPython 3.10.8 due to a quadratic running time (as opposed to a linear running time for CPython 3.10.8).

The analysis shows that CPython 3.10.8 was generating an INPLACE_ADD instruction so PyUnicode_Append is called at runtime, while CPython 3.11.0 new generates a BINARY_OP instruction so PyUnicode_Concat is actually called. The later function creates a new bigger string reducing drastically the performance of the string appending loop in the provided code. This appears to be related to the issue #89799 . I think if we want to replace INPLACE_ADD with a BINARY_OP, then an optimization checking the number of references (so to eventually do an in-place operation) is missing in the code of CPython 3.11.0. What do you think about it?

My environment is an embedded CPython 3.10.8 and an embedded CPython 3.11.0, both running on Windows 10 (22H2) with a x86-64 processor (i5-9600KF).

@zephyr111 zephyr111 added the type-bug An unexpected behavior, bug, or error label Nov 29, 2022
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Nov 29, 2022
@ned-deily
Copy link
Member

ned-deily commented Nov 29, 2022

@brandtbucher @pablogsal

@sweeneyde
Copy link
Member

sweeneyde commented Nov 29, 2022

Python 3.11 has a specialized (a la PEP 659) instruction called BINARY_OP_INPLACE_ADD_UNICODE:

cpython/Python/ceval.c

Lines 2064 to 2098 in 5bbf8ed

TARGET(BINARY_OP_INPLACE_ADD_UNICODE) {
assert(cframe.use_tracing == 0);
PyObject *left = SECOND();
PyObject *right = TOP();
DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
_Py_CODEUNIT true_next = next_instr[INLINE_CACHE_ENTRIES_BINARY_OP];
assert(_Py_OPCODE(true_next) == STORE_FAST ||
_Py_OPCODE(true_next) == STORE_FAST__LOAD_FAST);
PyObject **target_local = &GETLOCAL(_Py_OPARG(true_next));
DEOPT_IF(*target_local != left, BINARY_OP);
STAT_INC(BINARY_OP, hit);
/* Handle `left = left + right` or `left += right` for str.
*
* When possible, extend `left` in place rather than
* allocating a new PyUnicodeObject. This attempts to avoid
* quadratic behavior when one neglects to use str.join().
*
* If `left` has only two references remaining (one from
* the stack, one in the locals), DECREFing `left` leaves
* only the locals reference, so PyUnicode_Append knows
* that the string is safe to mutate.
*/
assert(Py_REFCNT(left) >= 2);
_Py_DECREF_NO_DEALLOC(left);
STACK_SHRINK(2);
PyUnicode_Append(target_local, right);
_Py_DECREF_SPECIALIZED(right, _PyUnicode_ExactDealloc);
if (*target_local == NULL) {
goto error;
}
// The STORE_FAST is already done.
JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP + 1);
DISPATCH();
}

However, this optimization was restricted to only local variables within a function, not global/module-scope variables.

Does the regression go away if you put the code in a function?

def f():
    a = ''
    for _ in range(1000000):
        a += 'a'

@zephyr111
Copy link
Author

zephyr111 commented Nov 29, 2022

Does the regression go away if you put the code in a function?

Indeed, this is significantly faster when the code is put in a function: 90 ms in a function in CPython 3.11.0 while it takes 140 ms in CPython 3.10.8. Thank you for this information. Thus, the problem is only for global variables. Solving this could be useful for people working directly in an interactive interpreter console (like IPython).

@pablogsal
Copy link
Member

pablogsal commented Nov 29, 2022

for _ in range(1000000):
    a += 'a'
end = time.time()

that code is quadratic and the only reason is not quadratic in reality is because we do some hacks in the eval loop to reuse the same string, but by the fact that strings are immutable it keeps being quadratic in nature. As I see it, there is not a lot of reason to spend a lot of resources to speed up this particular code, especially if the regression involves globals.

The correct idiom for this should be:

parts = []
for _ in range(100000):
    parts.append('a')
a = "".join(parts)

I am inclined to suggest to close this as "won't fix" unless someone thinks strongly otherwise

@eendebakpt
Copy link
Contributor

eendebakpt commented Nov 30, 2022

Closing this issue is fine with me as well, but I was surprised to find out global/module level code can be slower in 3.11 than local code. Is the behaviour documented somewhere? I could not find it in the 3.11 release notes or in https://github.com/python/cpython/blob/main/Python/adaptive.md.

@gvanrossum
Copy link
Member

gvanrossum commented Dec 21, 2022

There are many situations where globals are slower than locals, due to the nature of locals. This is nothing new.

(Update: I initially wrote faster, should be slower.)

@gvanrossum gvanrossum closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants