Severe performance degradation for tracing under 3.11 #93516

nedbat · 2022-06-05T12:21:38Z

Bug report

Coverage.py is seeing a significant increase in overhead for tracing code in 3.11 compared to 3.10: nedbat/coveragepy#1287

As an example:

cov	proj	python3.10	python3.11	3.11 vs 3.10
none	bug1339.py	0.184 s	0.142 s	76%
none	bm_sudoku.py	10.789 s	9.901 s	91%
none	bm_spectral_norm.py	14.305 s	9.185 s	64%
6.4.1	bug1339.py	0.450 s	0.854 s	189%
6.4.1	bm_sudoku.py	27.141 s	55.504 s	204%
6.4.1	bm_spectral_norm.py	36.793 s	67.970 s	184%

(This is the output of lab/benchmark.py.)

Your environment

CPython versions tested on: 3.10, 3.11
Operating system and architecture: MacOS, Intel

nedbat · 2022-06-05T12:21:57Z

@pablogsal @markshannon

Fidget-Spinner · 2022-06-05T13:53:00Z

Related issue where a user reported that code with cProfile slowed down 1.6x, and the only thing I could pinpoint was that tracing itself slowed down, not cProfile #93381.

sweeneyde · 2022-06-06T03:08:11Z

I now realize this is probably due to using the Python tracer rather than the C tracer, will try again.

sweeneyde · 2022-06-06T03:44:18Z

Okay, now with Code coverage for Python, version 6.4.1 with C extension.

Script used:

import coverage
from time import perf_counter

def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n-1) + fib(n-2)

t0 = perf_counter()
cov = coverage.Coverage()
cov.start()
fib(35)
cov.stop()
cov.save()
t1 = perf_counter()
print(t1 - t0)

Profile on 3.10 branch: took 13.3 seconds

Functions taking more than 1% of CPU:

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
\| - _PyEval_EvalFrameDefault	12372 (99.78%)	3540 (28.55%)	python310.dll
\| - [External Call] tracer.cp310-win_amd64.pyd	2769 (22.33%)	1474 (11.89%)	tracer.cp310-win_amd64.pyd
\| - maybe_call_line_trace	3872 (31.23%)	1206 (9.73%)	python310.dll
\| - _PyCode_CheckLineNumber	1168 (9.42%)	988 (7.97%)	python310.dll
\| - call_trace	3431 (27.67%)	548 (4.42%)	python310.dll
\| - PyLong_FromLong	370 (2.98%)	369 (2.98%)	python310.dll
\| - _PyFrame_New_NoTrack	385 (3.11%)	338 (2.73%)	python310.dll
\| - lookdict_unicode_nodummy	324 (2.61%)	324 (2.61%)	python310.dll
\| - call_trace_protected	2107 (16.99%)	287 (2.31%)	python310.dll
\| - frame_dealloc	270 (2.18%)	270 (2.18%)	python310.dll
\| - set_add_entry	264 (2.13%)	264 (2.13%)	python310.dll
\| - PyDict_GetItem	579 (4.67%)	255 (2.06%)	python310.dll
\| - _PyEval_MakeFrameVector	633 (5.11%)	248 (2.00%)	python310.dll
\| - _PyEval_Vector	12372 (99.78%)	196 (1.58%)	python310.dll
\| - PyLineTable_NextAddressRange	180 (1.45%)	180 (1.45%)	python310.dll
\| - call_function	12372 (99.78%)	169 (1.36%)	python310.dll
\| - PyObject_RichCompare	457 (3.69%)	169 (1.36%)	python310.dll
\| - binary_op1	486 (3.92%)	157 (1.27%)	python310.dll
\| - long_richcompare	207 (1.67%)	157 (1.27%)	python310.dll

Profile on main branch: took 23.0 seconds

Functions taking more than 1% of CPU:

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
\| - _PyEval_EvalFrameDefault	22180 (99.87%)	4448 (20.03%)	python312.dll
\| - _PyCode_CheckLineNumber	5389 (24.27%)	2583 (11.63%)	python312.dll
\| - maybe_call_line_trace	6461 (29.09%)	1966 (8.85%)	python312.dll
\| - retreat	2242 (10.10%)	1644 (7.40%)	python312.dll
\| - [External Call] tracer.cp312-win_amd64.pyd	6274 (28.25%)	1428 (6.43%)	tracer.cp312-win_amd64.pyd
\| - get_line_delta	1162 (5.23%)	1162 (5.23%)	python312.dll
\| - unicodekeys_lookup_unicode	844 (3.80%)	728 (3.28%)	python312.dll
\| - _Py_dict_lookup	1476 (6.65%)	632 (2.85%)	python312.dll
\| - call_trace	9802 (44.14%)	630 (2.84%)	python312.dll
\| - siphash13	442 (1.99%)	442 (1.99%)	python312.dll
\| - PyUnicode_New	745 (3.35%)	376 (1.69%)	python312.dll
\| - pymalloc_alloc	362 (1.63%)	362 (1.63%)	python312.dll
\| - _PyType_Lookup	1786 (8.04%)	319 (1.44%)	python312.dll
\| - set_add_entry	271 (1.22%)	271 (1.22%)	python312.dll
\| - PyDict_GetItem	824 (3.71%)	242 (1.09%)	python312.dll
\| - call_trace_protected	8562 (38.55%)	236 (1.06%)	python312.dll
\| - PyNumber_Subtract	449 (2.02%)	221 (1.00%)	python312.dll

sweeneyde · 2022-06-06T04:04:54Z

This seems to suggest to me that this could be made much faster via a dedicated C API function:

https://github.com/nedbat/coveragepy/blob/master/coverage/ctracer/util.h#L41

#define MyCode_GetCode(co)      (PyObject_GetAttrString((PyObject *)(co), "co_code"))

sweeneyde · 2022-06-06T04:15:59Z

Indeed, on main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key)); to the top of unicodekeys_lookup_unicode and run that fib script, I get the following hot loop repeated over and over:

lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
...

Each of these calls to PyObject_GetAttrString(..., "co_code") requires PyUnicode_FromString->unicode_decode_utf8->PyUnicode_New->_PyObject_Malloc, followed by PyObject_GetAttr->_PyObject_GenericGetAttrWithDict->_PyType_Lookup->find_name_in_mro->(unicode_hash+_Py_Dict_Lookup)->unicodekeys_lookup_unicode, where it should be just a C API function call.

sweeneyde · 2022-06-06T04:18:33Z

Ah! PyCode_GetCode exists since #92168

markshannon · 2022-06-06T14:45:37Z

Some performance degradation under tracing is expected, but not as much as reported.
This is a deliberate tradeoff: faster execution when not tracing, for slower execution when tracing.

From the profiles that @sweeneyde gathered it looks like we are seeing slowdowns in:

_PyEval_EvalFrameDefault, which is expected
Calculation of line numbers. This is probably due to the new line table format.

The new line number table gives better error messages without taking up too much extra space, but it is more complex and thus slower to parse.
We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Once nedbat/coveragepy#1394 is merged, we can look at the profile again to see if much has changed.

sweeneyde · 2022-06-06T15:51:48Z

New profile of fib_cover.py on main branch, with the PyCode_GetCode addition (took 18.1 seconds)

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
\| - _PyEval_EvalFrameDefault	19239 (99.66%)	4481 (23.21%)	python312.dll
\| - _PyCode_CheckLineNumber	5344 (27.68%)	2676 (13.86%)	python312.dll
\| - maybe_call_line_trace	6397 (33.14%)	1997 (10.34%)	python312.dll
\| - retreat	2197 (11.38%)	1629 (8.44%)	python312.dll
\| - [External Call] tracer.cp312-win_amd64.pyd	3073 (15.92%)	1360 (7.04%)	tracer.cp312-win_amd64.pyd
\| - get_line_delta	1038 (5.38%)	1038 (5.38%)	python312.dll
\| - call_trace	6630 (34.34%)	605 (3.13%)	python312.dll
\| - unicodekeys_lookup_unicode	549 (2.84%)	548 (2.84%)	python312.dll
\| - _Py_dict_lookup	1060 (5.49%)	511 (2.65%)	python312.dll
\| - PyDict_GetItem	929 (4.81%)	266 (1.38%)	python312.dll
\| - set_add_entry	265 (1.37%)	265 (1.37%)	python312.dll
\| - PyNumber_Subtract	447 (2.32%)	226 (1.17%)	python312.dll
\| - call_trace_protected	5408 (28.01%)	222 (1.15%)	python312.dll
\| - initialize_locals	199 (1.03%)	199 (1.03%)	python312.dll

I wonder: would there be any way to retain some specializations during tracing? Some specialized opcodes are certainly incorrect to use during tracing, e.g., STORE_FAST__LOAD_FAST. However, couldn't others be retained, e.g. LOAD_GLOBAL_MODULE, or any specialized opcode that covers only one unspecialized opcode? (EDIT: maybe not, if it uses NOTRACE_DISPATCH())

pablogsal · 2022-06-06T18:05:57Z

We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Well, two times slower than 3.10, as some of @nedbat's benchmarks show, is not acceptable in my opinion. We should try to get this below 20% at the very least.

I marked this as a release blocker and this will block any future releases.

markshannon · 2022-06-07T09:43:30Z

The problem with saying that something is unacceptable, is that implies there is an acceptable alternative.
What is that alternative?

What does it mean making this a release blocker? Are you not going to release 3.11 if coverage is >20% slower?

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.
I expect that will be better than a 20% slowdown, but I'm not making any pronouncements.

pablogsal · 2022-06-07T10:32:09Z

What is that alternative?

I don't know, but there is always the possibility to revert the changes that made coverage slower (worst case scenario).

Are you not going to release 3.11 if coverage is >20% slower?

I will check with the SC, but I can already tell you that I would not be comfortable releasing 3.11 if is 2 times slower in coverage, unless the SC mandates otherwise.

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.

The problem here is that this tradeoff was not discussed anywhere or advertised and no one had the possibility to pronounce or to make tradeoffs or otherwise, and therefore we cannot just "decide" out of the blue what is and what is not acceptable because the community didn't had the change to object.

markshannon · 2022-06-07T11:40:07Z

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657. The proximate cause is my PR to reduce the memory consumed of PEP 657.
Would you revert PEP 657, or just accept a large increase in memory use?

I think it would be a lot more helpful if we fixed the issue, rather than making dire pronouncements.

the community didn't had the change to object.

Everything we do is public. I even opened nedbat/coveragepy#1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

pablogsal · 2022-06-07T12:23:32Z

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657.

That's not true and I feel is misleading. When line numbers and column offsets were separated, PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

Everything we do is public. I even opened nedbat/coveragepy#1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

That's not enough. If that were true we would not need to do PEPs because "what we do is already public". The point of having discussion on mailing lists is precisely so everyone can be aware of the tradeoff and give us their opinions.

pablogsal · 2022-06-07T12:30:51Z

In any case, there is no point to discuss in these terms because I am not the one making decisions here. As RM my only authority is to consider the current slowdown a blocker and ask for a bigger discussion to be taken if we cannot fix the issue.

Let's see what we can do and then let's discuss how to proceed once we understand our options.

markshannon · 2022-06-07T12:46:35Z

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

No, that was the proximate cause. It wouldn't have been necessary if not for PEP 657.
I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

pablogsal · 2022-06-07T12:49:21Z

I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Mark, I understand that and we are on the same page. I know why you made that change, and IIRC I even reviewed it. The problem here is that that change (or other changes) had unexpected consequences on timing and we need to make decisions either trying to fix together it if we can or potentially reverting it we cannot. That's all I am saying

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

The claim was that PEP 657 had zero overhead on the time computing the line numbers, not on memory.My sentence was:

PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

gpshead · 2022-06-08T00:41:30Z

Marking this as a release or deferred blocker is fine and entirely what I'd expect any RM to want to do. That just means we collect more information and make a decision before we exit beta stages.

I don't personally have a problem with coverage being slower, even by 2x. It's more about understanding what all is impacted by this and how so we can explain our measured rationale to users. ie: Look beyond just coverage which should only be a Developer/CI time tool rather than something slowing people's "production" release tool/application/notebook code.

Q1: What else uses tracing?

Q1b: What of those use it at other than development time?

These questions are informational and do not block performance work on reducing the impact to tracing.

sweeneyde · 2022-06-08T01:14:33Z

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

Fidget-Spinner · 2022-06-08T03:47:15Z

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

In #93383 (comment) I benchmarked adding an additional field to the code object and there was no slowdown in pyperformance. I do not know how memory consumption is affected.

I'm not sure co_extra is the right place for that info considering its for consumers of the PEP 523 API. Maybe I can merge your idea into a simple lazily created _PyCodeTracingInfo * C struct?

PyCodeObject {
    ...
     void *co_tracing_info
}

struct _PyCodeTracingInfo {
     // cached code bytes object
    PyObject *co_code;
     // decompressed line number table
    ...
}

markshannon · 2022-06-08T08:29:19Z

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

An early design for the compressed line number table did just this.
My concern, then and now, is that the uncompressed table is a lot bigger, and never goes away.

It isn't just tracing that needs line numbers, tracebacks need them too.
Which means that the uncompressed table would be created every time a code object ends up in a traceback.

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

markshannon · 2022-06-08T09:56:21Z

@gpshead

Q1: What else uses tracing?

Debuggers and profiling tools, like cProfile.

Q1b: What of those use it at other than development time?

Hopefully, no one. Tracing has always imposed a considerable overhead.
Anyone using a profiler in production should be using a statistical profiler.

pablogsal · 2022-06-08T11:32:07Z

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

Maybe we could leave line numbers uncompressed and compress the rest?

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

markshannon · 2022-06-08T15:37:42Z

Maybe we could leave line numbers uncompressed and compress the rest?

That's going to be complicated. Too complicated for a beta release, IMO.

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

If we add a small/large flag, then the extra memory will be 2 bytes for instructions for files < 64k lines (basically all code), and 4 bytes per instruction for giant files.
I'll try it out.

markshannon · 2022-06-08T15:38:53Z

I've been experimenting with streamlining the advance and maybe_call_line_trace functions. I'm getting roughly a 10% speedup, but a flat table lookup will be way faster.

markshannon · 2022-06-08T15:58:31Z

Actually, if we store the line delta from co_firstline, many functions will only need 1 bytes per code unit, which should be fine.

nedbat · 2022-06-12T10:55:16Z

I'm very interested in helping to test people's proposed fixes, but am losing track of what is being changed where. We have #93493, which I have tried. Are there other pull requests in flight to try?

pablogsal · 2022-06-12T13:26:42Z

I would like to see how much we gain back by reverting 944fffe (not that we are planning to do it, just want to know how much the problem is due to that).

I can prepare a PR for you to test later today but in the interim you can try reverting yourself (it doesn't cleanly revert, but the conflicts are due to automatically generated files mainly that you can regenerate by running make regen-all).

Fidget-Spinner · 2022-06-12T13:33:41Z

I would like to see how much we gain back by reverting 944fffe (not that we are planning to do it, just want to know how much the problem is due to that).

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

python/pyperformance#213 (comment)

pablogsal · 2022-06-12T17:46:36Z

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

~~That's in pyperformance, no? I meant to see how much it performs in coverage~~ Sorry, I checked the link in more detail. Seems that this is what I wanted to check 😓 Sorry for the confusion :)

markshannon · 2022-06-13T10:06:58Z

Looking at the output of perf (using an optimized, but no pgo, lto build) and the code, I see the following reasons for the slowdown:

Increased overhead in the _PyEval_EvalFrameDefault determining whether the event is "call", possible "line" or no event.
Increased overhead in the _PyEval_EvalFrameDefault as we no longer specialize when tracing.
Increased overhead in maybe_call_line_trace finding the first traceable instruction
Large slowdown in determining the line number from the index.

All but 2 can be fixed without any negative impact on performance when not tracing.
Fixing 4 will increase memory use when tracing, but when not tracing, will only need a tiny bit of extra memory.

Specializing during tracing, increases the overhead of dispatching when not-tracing, resulting in a ~1% slowdown.

kumaraditya303 · 2022-06-13T11:17:23Z

FYI, I have added a coverage ¹ benchmark to pyperformance to ease benchmarking coverage / tracing performance.

python/pyperformance#213 ↩

markshannon · 2022-06-13T11:53:35Z

#93763

markshannon · 2022-06-13T18:01:15Z

#93769

…H-93769)

markshannon · 2022-06-14T16:47:40Z

#93769 is merged. #93818 is the backport to 3.11

nedbat added the type-bug label Jun 5, 2022

nedbat mentioned this issue Jun 5, 2022

Performance of coverage.py under 3.11 nedbat/coveragepy#1287

Open

pablogsal added release-blocker 3.11 labels Jun 5, 2022

This comment was marked as off-topic.

Sign in to view

sweeneyde mentioned this issue Jun 6, 2022

Use PyCode_GetCode nedbat/coveragepy#1394

Merged

kumaraditya303 mentioned this issue Jun 6, 2022

Add coverage benchmark python/pyperformance#213

Merged

DanielNoord mentioned this issue Jun 12, 2022

Python 3.11.0b3 crashes dill uqfoundation/dill#509

Closed

markshannon added a commit that referenced this issue Jun 14, 2022

GH-93516: Store offset of first traceable instruction in code object (G…

3cd1a5d

…H-93769)

Severe performance degradation for tracing under 3.11 #93516

Severe performance degradation for tracing under 3.11 #93516

nedbat commented Jun 5, 2022

nedbat commented Jun 5, 2022

Fidget-Spinner commented Jun 5, 2022 •

edited

This comment was marked as off-topic.

This comment was marked as off-topic.

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

markshannon commented Jun 6, 2022

sweeneyde commented Jun 6, 2022 •

edited

pablogsal commented Jun 6, 2022 •

edited

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 •

edited

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 •

edited

pablogsal commented Jun 7, 2022

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 •

edited

gpshead commented Jun 8, 2022

sweeneyde commented Jun 8, 2022

Fidget-Spinner commented Jun 8, 2022 •

edited

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022

pablogsal commented Jun 8, 2022

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022 •

edited

nedbat commented Jun 12, 2022

pablogsal commented Jun 12, 2022

Fidget-Spinner commented Jun 12, 2022 •

edited

pablogsal commented Jun 12, 2022 •

edited

markshannon commented Jun 13, 2022 •

edited

kumaraditya303 commented Jun 13, 2022

markshannon commented Jun 13, 2022

markshannon commented Jun 13, 2022

markshannon commented Jun 14, 2022

Severe performance degradation for tracing under 3.11 #93516

Severe performance degradation for tracing under 3.11 #93516

Comments

nedbat commented Jun 5, 2022

nedbat commented Jun 5, 2022

Fidget-Spinner commented Jun 5, 2022 • edited

This comment was marked as off-topic.

This comment was marked as off-topic.

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

sweeneyde commented Jun 6, 2022

markshannon commented Jun 6, 2022

sweeneyde commented Jun 6, 2022 • edited

pablogsal commented Jun 6, 2022 • edited

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 • edited

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 • edited

pablogsal commented Jun 7, 2022

markshannon commented Jun 7, 2022

pablogsal commented Jun 7, 2022 • edited

gpshead commented Jun 8, 2022

sweeneyde commented Jun 8, 2022

Fidget-Spinner commented Jun 8, 2022 • edited

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022

pablogsal commented Jun 8, 2022

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022

markshannon commented Jun 8, 2022 • edited

nedbat commented Jun 12, 2022

pablogsal commented Jun 12, 2022

Fidget-Spinner commented Jun 12, 2022 • edited

pablogsal commented Jun 12, 2022 • edited

markshannon commented Jun 13, 2022 • edited

kumaraditya303 commented Jun 13, 2022

Footnotes

markshannon commented Jun 13, 2022

markshannon commented Jun 13, 2022

markshannon commented Jun 14, 2022

Fidget-Spinner commented Jun 5, 2022 •

edited

sweeneyde commented Jun 6, 2022 •

edited

pablogsal commented Jun 6, 2022 •

edited

pablogsal commented Jun 7, 2022 •

edited

pablogsal commented Jun 7, 2022 •

edited

pablogsal commented Jun 7, 2022 •

edited

Fidget-Spinner commented Jun 8, 2022 •

edited

markshannon commented Jun 8, 2022 •

edited

Fidget-Spinner commented Jun 12, 2022 •

edited

pablogsal commented Jun 12, 2022 •

edited

markshannon commented Jun 13, 2022 •

edited