Slower string concatenation in CPython 3.11 #99862

zephyr111 · 2022-11-29T00:49:09Z

Hello,

We have found a regression between CPython 3.10.8 and CPython 3.11 resulting in string concatenation to be significantly slower in loops on Windows 10. This is described in details in this StackOverflow post.

Here is a minimal, reproducible example of benchmarking code:

import time
a = 'a'

start = time.time()
for _ in range(1000000):
    a += 'a'
end = time.time()

print(a[:5], (end-start) * 1000)

CPython 3.11.0 is about 100 times slower than CPython 3.10.8 due to a quadratic running time (as opposed to a linear running time for CPython 3.10.8).

The analysis shows that CPython 3.10.8 was generating an INPLACE_ADD instruction so PyUnicode_Append is called at runtime, while CPython 3.11.0 new generates a BINARY_OP instruction so PyUnicode_Concat is actually called. The later function creates a new bigger string reducing drastically the performance of the string appending loop in the provided code. This appears to be related to the issue #89799 . I think if we want to replace INPLACE_ADD with a BINARY_OP, then an optimization checking the number of references (so to eventually do an in-place operation) is missing in the code of CPython 3.11.0. What do you think about it?

My environment is an embedded CPython 3.10.8 and an embedded CPython 3.11.0, both running on Windows 10 (22H2) with a x86-64 processor (i5-9600KF).

The text was updated successfully, but these errors were encountered:

ned-deily · 2022-11-29T01:24:04Z

@brandtbucher @pablogsal

sweeneyde · 2022-11-29T05:18:39Z

Python 3.11 has a specialized (a la PEP 659) instruction called BINARY_OP_INPLACE_ADD_UNICODE:

cpython/Python/ceval.c

Lines 2064 to 2098 in 5bbf8ed

    
                   TARGET(BINARY_OP_INPLACE_ADD_UNICODE) { 
        
                       assert(cframe.use_tracing == 0); 
        
                       PyObject *left = SECOND(); 
        
                       PyObject *right = TOP(); 
        
                       DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP); 
        
                       DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP); 
        
                       _Py_CODEUNIT true_next = next_instr[INLINE_CACHE_ENTRIES_BINARY_OP]; 
        
                       assert(_Py_OPCODE(true_next) == STORE_FAST || 
        
                              _Py_OPCODE(true_next) == STORE_FAST__LOAD_FAST); 
        
                       PyObject **target_local = &GETLOCAL(_Py_OPARG(true_next)); 
        
                       DEOPT_IF(*target_local != left, BINARY_OP); 
        
                       STAT_INC(BINARY_OP, hit); 
        
                       /* Handle `left = left + right` or `left += right` for str. 
        
                        * 
        
                        * When possible, extend `left` in place rather than 
        
                        * allocating a new PyUnicodeObject. This attempts to avoid 
        
                        * quadratic behavior when one neglects to use str.join(). 
        
                        * 
        
                        * If `left` has only two references remaining (one from 
        
                        * the stack, one in the locals), DECREFing `left` leaves 
        
                        * only the locals reference, so PyUnicode_Append knows 
        
                        * that the string is safe to mutate. 
        
                        */ 
        
                       assert(Py_REFCNT(left) >= 2); 
        
                       _Py_DECREF_NO_DEALLOC(left); 
        
                       STACK_SHRINK(2); 
        
                       PyUnicode_Append(target_local, right); 
        
                       _Py_DECREF_SPECIALIZED(right, _PyUnicode_ExactDealloc); 
        
                       if (*target_local == NULL) { 
        
                           goto error; 
        
                       } 
        
                       // The STORE_FAST is already done. 
        
                       JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP + 1); 
        
                       DISPATCH(); 
        
                   }

However, this optimization was restricted to only local variables within a function, not global/module-scope variables.

Does the regression go away if you put the code in a function?

def f():
    a = ''
    for _ in range(1000000):
        a += 'a'

zephyr111 · 2022-11-29T19:51:31Z

Does the regression go away if you put the code in a function?

Indeed, this is significantly faster when the code is put in a function: 90 ms in a function in CPython 3.11.0 while it takes 140 ms in CPython 3.10.8. Thank you for this information. Thus, the problem is only for global variables. Solving this could be useful for people working directly in an interactive interpreter console (like IPython).

pablogsal · 2022-11-29T20:18:35Z

for _ in range(1000000):
    a += 'a'
end = time.time()

that code is quadratic and the only reason is not quadratic in reality is because we do some hacks in the eval loop to reuse the same string, but by the fact that strings are immutable it keeps being quadratic in nature. As I see it, there is not a lot of reason to spend a lot of resources to speed up this particular code, especially if the regression involves globals.

The correct idiom for this should be:

parts = []
for _ in range(100000):
    parts.append('a')
a = "".join(parts)

I am inclined to suggest to close this as "won't fix" unless someone thinks strongly otherwise

eendebakpt · 2022-11-30T22:32:02Z

Closing this issue is fine with me as well, but I was surprised to find out global/module level code can be slower in 3.11 than local code. Is the behaviour documented somewhere? I could not find it in the 3.11 release notes or in https://github.com/python/cpython/blob/main/Python/adaptive.md.

gvanrossum · 2022-12-21T17:38:55Z

There are many situations where globals are slower than locals, due to the nature of locals. This is nothing new.

(Update: I initially wrote faster, should be slower.)

zephyr111 added the type-bug An unexpected behavior, bug, or error label Nov 29, 2022

AlexWaygood added the performance Performance or resource usage label Nov 29, 2022

gvanrossum closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022

Slower string concatenation in CPython 3.11 #99862

Slower string concatenation in CPython 3.11 #99862

zephyr111 commented Nov 29, 2022

ned-deily commented Nov 29, 2022

sweeneyde commented Nov 29, 2022

zephyr111 commented Nov 29, 2022 •

edited

pablogsal commented Nov 29, 2022 •

edited

eendebakpt commented Nov 30, 2022

gvanrossum commented Dec 21, 2022 •

edited

Slower string concatenation in CPython 3.11 #99862

Slower string concatenation in CPython 3.11 #99862

Comments

zephyr111 commented Nov 29, 2022

ned-deily commented Nov 29, 2022

sweeneyde commented Nov 29, 2022

zephyr111 commented Nov 29, 2022 • edited

pablogsal commented Nov 29, 2022 • edited

eendebakpt commented Nov 30, 2022

gvanrossum commented Dec 21, 2022 • edited

zephyr111 commented Nov 29, 2022 •

edited

pablogsal commented Nov 29, 2022 •

edited

gvanrossum commented Dec 21, 2022 •

edited