Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial Performance Regression of Dict operations in Python 3.12.0rc1 versus Python 3.11.4 #109049

Open
chrisgmorton opened this issue Sep 7, 2023 · 15 comments
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@chrisgmorton
Copy link

chrisgmorton commented Sep 7, 2023

Bug report

Bug description:

Comparing Python 3.11.4 with 3.12.0rc1 I see a substantial slowdown of dictionary operations (update, copy, items, get). Build is from source on Ubuntu 20.04 using gcc/g++ 9.4.0, with configure options --enable-shared and --enable-optimizations. The NumPy version in both cases is 1.25.2 with linkage to mkl 2023.2.0.

Profiling results for my application are as follows:
Python 3.12.0rc1:

Wed Sep  6 18:55:23 2023    profile.binfile

         27271926 function calls (26072328 primitive calls) in 20.862 seconds

   Ordered by: internal time
   List reduced from 520 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7045    4.000    0.001    5.031    0.001 digraph.py:469(add_nodes_from)
     7045    3.003    0.000    5.198    0.001 digraph.py:713(add_edges_from)
  7056013    1.952    0.000    2.093    0.000 {method 'update' of 'dict' objects}
     1765    1.304    0.001    1.351    0.001 core.py:7090(concatenate)
  1232480    0.752    0.000    1.121    0.000 reportviews.py:788(<genexpr>)
    91026    0.699    0.000    1.069    0.000 resolver.py:92(Define)
   203489    0.645    0.000    1.167    0.000 core.py:2952(_update_from)
    41601    0.568    0.000    0.581    0.000 device.py:157(Default)
    41601    0.479    0.000    0.917    0.000 mnaindexer.py:238(SetIndexing)
        2    0.405    0.203    0.469    0.234 mnaloader.py:95(SetLinearTerms)
  1495228    0.326    0.000    0.326    0.000 {method 'copy' of 'dict' objects}
    83202    0.320    0.000    0.558    0.000 device.py:40(SetSocket)
  1496872    0.311    0.000    0.311    0.000 {method 'items' of 'dict' objects}
  1626268    0.308    0.000    0.308    0.000 {method 'get' of 'dict' objects}

Python 3.11.4:

Wed Sep  6 18:54:04 2023    profile.binfile

         27569104 function calls (26369506 primitive calls) in 16.836 seconds

   Ordered by: internal time
   List reduced from 541 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7045    4.409    0.001    4.802    0.001 digraph.py:469(add_nodes_from)
     7045    1.576    0.000    2.972    0.000 digraph.py:713(add_edges_from)
     1765    1.337    0.001    1.380    0.001 core.py:7090(concatenate)
  1232480    0.767    0.000    0.961    0.000 reportviews.py:788(<genexpr>)
  7056013    0.765    0.000    0.881    0.000 {method 'update' of 'dict' objects}
    91026    0.520    0.000    0.787    0.000 resolver.py:92(Define)
   203489    0.486    0.000    0.721    0.000 core.py:2952(_update_from)
    41601    0.464    0.000    0.464    0.000 device.py:37(<dictcomp>)
    41601    0.369    0.000    0.704    0.000 mnaindexer.py:238(SetIndexing)
        2    0.331    0.165    0.413    0.206 mnaloader.py:95(SetLinearTerms)
    83202    0.312    0.000    0.402    0.000 device.py:40(SetSocket)
    93800    0.270    0.000    1.559    0.000 core.py:3217(__getitem__)
    43368    0.206    0.000    1.698    0.000 resolver.py:233(Connect)
1332528/332816    0.186    0.000    0.187    0.000 cell.py:20(GetDevices)
   267312    0.182    0.000    0.182    0.000 {built-in method numpy.array}
   104392    0.178    0.000    0.722    0.000 core.py:2978(__array_finalize__)
        1    0.148    0.148    0.921    0.921 mnamethod.py:276(SetLinearDeviceRulesIndexing)
   351240    0.129    0.000    0.839    0.000 {function MaskedArray.view at 0x7fc262658540}
  1227200    0.112    0.000    0.112    0.000 reportviews.py:774(<lambda>)
   1765/1    0.110    0.000   12.163   12.163 resolver.py:350(Build)
  1663235    0.109    0.000    0.109    0.000 {built-in method builtins.getattr}
    41601    0.105    0.000    0.162    0.000 mnaindexer.py:268(SetMatrixConstructionIndexing)
  1496872    0.101    0.000    0.101    0.000 {method 'items' of 'dict' objects}
  1643874    0.099    0.000    0.099    0.000 {method 'get' of 'dict' objects}
  5292/12    0.098    0.000    0.743    0.062 cell.py:275(SetParameters)
  1495228    0.098    0.000    0.098    0.000 {method 'copy' of 'dict' objects}

The slowdowns in networkx digraph class methods add_edges_from(), in particular, and add_nodes_from() are likely caused by the performance degradation of the python dictionary methods.

CPython versions tested on:

3.12

Operating systems tested on:

Linux

@chrisgmorton chrisgmorton added the type-bug An unexpected behavior, bug, or error label Sep 7, 2023
@gaogaotiantian
Copy link
Contributor

gaogaotiantian commented Sep 7, 2023

Sorry but I don't think this proves anything.

There could be a LOT of reasons why a complicated function slows down with a different interpreter. If this is indeed a performance regression of dictionary, you should be able to reproduce the result with ONLY dictionary operations. The issue itself is more like a speculation.

In fact, cProfile itself changed between 3.11 and 3.12, and it's a deterministic profiler which could introduce a signficant overhead if you have many small function calls.

Like I said, if you believe this is a dictionary performance regression, you should be able to isolate the dict operation, do it in a loop and stop-watch it - it should show the regression. Otherwise, I don't think this worth investigation from CPython's perspective.

@hugovk hugovk added performance Performance or resource usage 3.12 bugs and security fixes 3.13 new features, bugs and security fixes labels Sep 7, 2023
@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 10, 2023

import time
from array import array
import sys

print("\nPython", sys.version.replace("\n",""), "\n")

a = array('d')
start = time.time()
for i in range(100000000): a.append(i)
end = time.time()

print('\tarray.append(): {0:3.1f}s'.format(end-start))

l = []
start = time.time()
for i in range(200000000): l.append(i)
end = time.time()

print('\tlist.append():  {0:3.1f}s'.format(end-start))

d = {}
start = time.time()
for i in range(25000000): d.update({str(i):i})
end = time.time()

print('\tdict.update():  {0:3.1f}s'.format(end-start))

s = set()
start = time.time()
for i in range(160000000): s.add(i)
end = time.time()

print('\tset.add():      {0:3.1f}s'.format(end-start))

Five manual, consecutive runs with Python 3.11:

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.4s
        list.append():  10.0s
        dict.update():  11.8s
        set.add():      9.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 10.6s
        list.append():  9.4s
        dict.update():  12.2s
        set.add():      10.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 10.4s
        list.append():  9.0s
        dict.update():  11.9s
        set.add():      10.0s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.7s
        list.append():  8.7s
        dict.update():  12.0s
        set.add():      9.2s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 9.5s
        list.append():  9.4s
        dict.update():  12.2s
        set.add():      9.6s

Five manual, consecutive runs with Python 3.12:

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.5s
        list.append():  10.4s
        dict.update():  13.1s
        set.add():      10.3s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 10.9s
        list.append():  11.9s
        dict.update():  12.7s
        set.add():      11.1s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.7s
        list.append():  10.8s
        dict.update():  13.3s
        set.add():      10.5s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 11.4s
        list.append():  11.2s
        dict.update():  12.9s
        set.add():      10.7s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 10.8s
        list.append():  9.6s
        dict.update():  12.8s
        set.add():      11.2s

I would say the results are of statistical significance and the slowdown in Python 3.12 versus Python 3.11 is evident here. It seems plausible that the root cause is related to memory management, not specific to the dict built-in as I had originally thought: this test case consumes 20GB+ of memory.

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 10, 2023

I should check the performance of the other dict methods I referenced in the original submission as there may be multiple issues here. I'll wait for feedback first on these tests.

@gaogaotiantian
Copy link
Contributor

Yes, this is observable slowdown. What's the result when the loop is smaller? Or when the memory consumption is smaller (same loop count, but do not accumulate memory that much).

It's not a common pattern to accumulate very small pieces of memory to a very large number (not saying that the slowdown does not mean anything). It could be a tradeoff to make common pattern faster.

It could be a memory management related reason, but it could also be something else. Not sure who's the expert in this area, but it would be nice to have more information (tests I mentioned above) so we may narrow down the cause.

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 11, 2023

Here's a quick response for the case where the loop is 100 times smaller for each datatype (again slowdown observed):

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 95.3ms
        list.append():  77.9ms
        dict.update():  74.0ms
        set.add():      99.2ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 97.6ms
        list.append():  82.8ms
        dict.update():  83.3ms
        set.add():      105.2ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 112.2ms
        list.append():  86.6ms
        dict.update():  75.9ms
        set.add():      104.6ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 95.3ms
        list.append():  82.1ms
        dict.update():  72.0ms
        set.add():      99.8ms

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 98.5ms
        list.append():  79.1ms
        dict.update():  74.8ms
        set.add():      108.0ms
Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 112.6ms
        list.append():  93.8ms
        dict.update():  82.1ms
        set.add():      112.5ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 112.6ms
        list.append():  105.6ms
        dict.update():  82.7ms
        set.add():      115.7ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 109.3ms
        list.append():  90.1ms
        dict.update():  83.5ms
        set.add():      117.1ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 108.1ms
        list.append():  90.6ms
        dict.update():  80.6ms
        set.add():      106.5ms

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 110.5ms
        list.append():  91.9ms
        dict.update():  84.2ms
        set.add():      110.8ms

@chrisgmorton
Copy link
Author

chrisgmorton commented Sep 11, 2023

Here's the original test case but with immediate removal of the data entry inside the loop either with pop() or del d[key]. In this case there is no growth in memory usage (as expected). It's unclear to me that this tells us much more. The previous tests would suggest we simply have a per operation slowdown.

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  15.7s
        dict.update():   9.7s
        set.add():      10.7s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.9s
        list.append():  16.8s
        dict.update():   9.9s
        set.add():      12.1s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  15.2s
        dict.update():   9.6s
        set.add():      11.9s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.9s
        list.append():  15.1s
        dict.update():   9.6s
        set.add():      12.1s

Python 3.11.4 (main, Sep 10 2023, 06:39:02) [GCC 9.4.0]

        array.append(): 12.3s
        list.append():  14.8s
        dict.update():   9.4s
        set.add():      12.5s
Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 14.9s
        list.append():  16.8s
        dict.update():  11.6s
        set.add():      12.2s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 15.4s
        list.append():  17.6s
        dict.update():  12.1s
        set.add():      13.3s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 14.3s
        list.append():  16.1s
        dict.update():  11.7s
        set.add():      14.1s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 16.2s
        list.append():  16.2s
        dict.update():  12.1s
        set.add():      12.9s

Python 3.12.0rc2 (main, Sep 10 2023, 07:51:16) [GCC 9.4.0]

        array.append(): 15.4s
        list.append():  19.3s
        dict.update():  11.7s
        set.add():      15.3s

@chrisgmorton
Copy link
Author

The performance regression seems to take place between 3.12.0a7 and 3.12.0b1. I'm not sure it's a single commit. It may be an accumulation of a number of commits, possibly in the area of sub-interpreter work.

@eendebakpt
Copy link
Contributor

I can confirm there is some kind of regression. Performing this benchmark:

import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

"""

runner.timeit(name=f"deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)

results in

Mean +- std dev: [3114] 7.18 us +- 0.09 us -> [main] 7.91 us +- 0.22 us: 1.10x slower

This is for comparison of 3.11.4 with current main.

@eendebakpt
Copy link
Contributor

From speed.python.org:

image

The performance regression seems to occur around the merge of #19474, which is known to reduce performance a bit.

@chrisgmorton
Copy link
Author

For the bigger picture, my application is in a class of Scientific Computing applications, numerically solving coupled non-linear PDEs and extensively using NumPy, SciPy, H5Py, and sparse matrix linear algebra.

The 3.11.4 and 3.12.0rc2 installations use identical Python module versions (NumPy, SciPy, H5Py/HDF5 built from source for each Python version).

These are my performance test results, benchmarked against Python 3.8:

Python 3.11.4:

 > Testing circperformance

Test 0: Time is 3.20771s, reference is 4.10, performance is 1.28x or 27.8% speed-up. PASS
Test 1: Time is 1.11444s, reference is 1.40, performance is 1.26x or 25.6% speed-up. PASS
Test 2: Time is 1.45542s, reference is 1.70, performance is 1.17x or 16.8% speed-up. PASS
Test 3: Time is 2.01886s, reference is 2.40, performance is 1.19x or 18.9% speed-up. PASS

100% pass: 4 Tests, failures[], skipped[]

Python 3.12.0rc2:

 > Testing circperformance

Test 0: Time is 3.78364s, reference is 4.10, performance is 1.08x or 8.4% speed-up. PASS
Test 1: Time is 1.30262s, reference is 1.40, performance is 1.07x or 7.5% speed-up. PASS
Test 2: Time is 1.61533s, reference is 1.70, performance is 1.05x or 5.2% speed-up. PASS
Test 3: Time is 2.41445s, reference is 2.40, performance is 0.99x or 0.6% slow-down. FAIL

75% pass: 4 Tests, failures[3], skipped[]

Python 3.11.4:

 > Testing devperformance

Test 0: Time is 1.64307s, reference is 1.85, performance is 1.13x or 12.6% speed-up. PASS
Test 1: Time is 2.28619s, reference is 2.60, performance is 1.14x or 13.7% speed-up. PASS
Test 2: Time is 1.62358s, reference is 1.90, performance is 1.17x or 17.0% speed-up. PASS
Test 3: Time is 1.75078s, reference is 2.00, performance is 1.14x or 14.2% speed-up. PASS
Test 4: Time is 2.06646s, reference is 2.20, performance is 1.06x or 6.5% speed-up. PASS
Test 5: Time is 1.57825s, reference is 1.75, performance is 1.11x or 10.9% speed-up. PASS
Test 6: Time is 2.35982s, reference is 2.80, performance is 1.19x or 18.7% speed-up. PASS
Test 7: Time is 2.03836s, reference is 2.20, performance is 1.08x or 7.9% speed-up. PASS
Test 8: Time is 4.01285s, reference is 4.50, performance is 1.12x or 12.1% speed-up. PASS

100% pass: 9 Tests, failures[], skipped[]

Python 3.12.0rc2

> Testing devperformance

Test 0: Time is 1.85779s, reference is 1.85, performance is 1.00x or 0.4% slow-down. FAIL
Test 1: Time is 2.53469s, reference is 2.60, performance is 1.03x or 2.6% speed-up. PASS
Test 2: Time is 1.89236s, reference is 1.90, performance is 1.00x or 0.4% speed-up. PASS
Test 3: Time is 1.98677s, reference is 2.00, performance is 1.01x or 0.7% speed-up. PASS
Test 4: Time is 2.25508s, reference is 2.20, performance is 0.98x or 2.5% slow-down. FAIL
Test 5: Time is 1.77294s, reference is 1.75, performance is 0.99x or 1.3% slow-down. FAIL
Test 6: Time is 2.61213s, reference is 2.80, performance is 1.07x or 7.2% speed-up. PASS
Test 7: Time is 2.19335s, reference is 2.20, performance is 1.00x or 0.3% speed-up. PASS
Test 8: Time is 4.52551s, reference is 4.50, performance is 0.99x or 0.6% slow-down. FAIL

55% pass: 9 Tests, failures[0, 4, 5, 8], skipped[]

As expected, we see substantial performance improvements for Python 3.11 over Python 3.8. However, Python 3.12.0rc2 appears to give up much of these gains.

@eendebakpt
Copy link
Contributor

eendebakpt commented Sep 22, 2023

I did some more tests with publicly available packages (sympy, lmfit and lark) and compared python 3.10, 3.11 and 3.12. The results are similar to the OP: a significant performance improvement from 3.10 to 3.11, but a regression from 3.11 to 3.12 for two out of the three test cases. For sympy the performance loss is about 10%.

3.10.8 (tags/v3.10.8-dirty:aaaf517424, May  9 2023, 23:40:53) [GCC 9.4.0]
sympy 0.35402853199775564
lmfit 0.3148488170008932
lark 0.3982844579986704
3.11.5 (main, Sep 22 2023, 16:56:56) [GCC 9.4.0]
sympy 0.2964389949993347
lmfit 0.281992337997508
lark 0.3138663429999724
3.12.0rc3 (main, Sep 22 2023, 21:36:02) [GCC 9.4.0]
sympy 0.32691953900211956
lmfit 0.29783570200015674
lark 0.2835550689997035
Full code for benchmarks

Source was downloaded from python.org and compiled with ./configure --enable-optimizations. Required packages have been installed with

python -m pip install sympy==1.12 numpy==1.26.0 scipy==1.11.2 lmfit==1.2.2 pyperformance==1.0.9 lark==1.1.7

Benchmark script

import sys
import time

import sympy
from lark import Lark, Transformer, v_args
from lmfit import Model
from numpy import exp, linspace, random

print(sys.version)


# %%


def gaussian(x, amp, cen, wid):
    return amp * exp(-(x-cen)**2 / wid)


def bench_sympy():
    x = sympy.symbols('x')

    for n in [1, 10, 20]:
        expr = x**n*sympy.exp(x)
        integral = sympy.Integral(expr, x)
        r = integral.doit()

        expr = sympy.log(sympy.Add(*[sympy.exp(i*x)
                         for i in range(n)])).diff(x)
        integral = sympy.Integral(expr, x)
        r = integral.doit()


# activate caches in sympy
bench_sympy()

t0 = time.perf_counter()
bench_sympy()
dt = time.perf_counter()-t0
print(f'sympy {dt}')

# %%


def bench_lmfit():

    x = linspace(-10, 10, 101)
    y = gaussian(x, 2.33, 0.21, 1.51) + random.normal(0, 0.2, x.size)

    gmodel = Model(gaussian)
    params = gmodel.make_params(cen=0.3, amp=3, wid=1.25)
    result = gmodel.fit(y, params, x=x)

    gmodel.set_param_hint('amp', min=2.34)
    result = gmodel.fit(y, params, x=x)


t0 = time.perf_counter()
for ii in range(40):
    bench_lmfit()
dt = time.perf_counter()-t0
print(f'lmfit {dt}')


# %%


json_grammar = r"""
    ?start: value

    ?value: object
          | array
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    array  : "[" [value ("," value)*] "]"
    object : "{" [pair ("," pair)*] "}"
    pair   : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS

    %ignore WS
"""


class TreeToJson(Transformer):
    @v_args(inline=True)
    def string(self, s):
        return s[1:-1].replace('\\"', '"')

    array = list
    pair = tuple
    object = dict
    number = v_args(inline=True)(float)

    def null(self, _): return None
    def true(self, _): return True
    def false(self, _): return False


def bench_lark():
    json_parser = Lark(json_grammar, parser='lalr',
                       lexer='basic',
                       propagate_positions=False,
                       maybe_placeholders=False,
                       transformer=TreeToJson())
    parse = json_parser.parse

    test_json = '''
        {
            "empty_object" : {},
            "empty_array"  : [],
            "booleans"     : { "YES" : true, "NO" : false },
            "numbers"      : [ 0, 1, -2, 3.3, 4.4e5, 6.6e-7 ],
            "strings"      : [ "This", [ "And" , "That", "And a \\"b" ] ],
            "nothing"      : null
        }
    '''

    j = parse(test_json)


t0 = time.perf_counter()
for ii in range(20):
    bench_lark()
dt = time.perf_counter()-t0
print(f'lark {dt}')

Update: I benchmarked sympy for ea2c001 (the immortal instances PR) and 916de04 (the commit before)
The performance regression is 5% (averaged over 10 runs: base: 0.29103303970023264 immortal instances: 0.30650054839989027). That means the other 5% performance regression is due to other commits between 3.11 and 3.12

@nijel
Copy link
Contributor

nijel commented Oct 12, 2023

I've found this issue while noticing that our CI tests occasionally timeout at GitHub on Python 3.12 while they worked fine on 3.11. We did see performance gains for several Python releases, but 3.12 seems to go back to the performance of 3.9 (at least CI wise).

Slowdown can be seen on other projects as well. For example, the above-mentioned sympy has some parts of the test suite (they run it split into 4 parts) 20% slower on Python 3.12 compared to 3.11, Django tests seem about 10% slower. I know that CI timing can vary a lot, but the regression seems to happen across projects, and this issue shows some synthetic benchmarks as well.

Links to CI runs

@Voultapher
Copy link

Voultapher commented Oct 27, 2023

The user sahnehaeubchen linked this issue to PEP 683 in this discussion, so far I don't see the possible connection talked about here. I don't know what they are basing the claim on, but it seems plausible to me that an additional branch on every object refcount check could have wide ranging impacts. The PEP claims:

A naive implementation of the approach described below makes CPython roughly 4% slower. However, the implementation is performance-neutral once known mitigations are applied.

I could image that the benchmarks performed might not account for branch miss-prediction due to BTB misses and generally larger code-sizes that don't fit into i-cache L1 and L0/1 BTB and other cache effects of larger programs that affect branch prediction cost.

@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 28, 2023
@ericsnowcurrently
Copy link
Member

CC @eduardo-elizondo

@eduardo-elizondo
Copy link
Contributor

Hey, just catching up here! Thanks for the ping @ericsnowcurrently - following on the above there is a link on the regression after the implementation of PEP683. There are benchmarks that perform better and benchmarks that perform worse. For the full list, take a look here which shows a ~1.02x regression on the geometric average.

One thing to add though, the build here uses gcc/g++ 9.4 which I empirically found to perform slightly worse with PEP683. Instead, I’d recommend trying this out again with GCC 11.4 (or LLVM 15+) which seems to fare much better with PEP683 than GCC 9.4.

Sorry for the delay in replies and happy to answer any follow-up questions 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

9 participants