Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe performance degradation for tracing under 3.11 #93516

Open
nedbat opened this issue Jun 5, 2022 · 37 comments
Open

Severe performance degradation for tracing under 3.11 #93516

nedbat opened this issue Jun 5, 2022 · 37 comments
Labels
3.11 release-blocker type-bug

Comments

@nedbat
Copy link
Member

@nedbat nedbat commented Jun 5, 2022

Bug report

Coverage.py is seeing a significant increase in overhead for tracing code in 3.11 compared to 3.10: nedbat/coveragepy#1287

As an example:

cov proj python3.10 python3.11 3.11 vs 3.10
none bug1339.py 0.184 s 0.142 s 76%
none bm_sudoku.py 10.789 s 9.901 s 91%
none bm_spectral_norm.py 14.305 s 9.185 s 64%
6.4.1 bug1339.py 0.450 s 0.854 s 189%
6.4.1 bm_sudoku.py 27.141 s 55.504 s 204%
6.4.1 bm_spectral_norm.py 36.793 s 67.970 s 184%

(This is the output of lab/benchmark.py.)

Your environment

  • CPython versions tested on: 3.10, 3.11
  • Operating system and architecture: MacOS, Intel
@nedbat nedbat added the type-bug label Jun 5, 2022
@nedbat
Copy link
Member Author

@nedbat nedbat commented Jun 5, 2022

@pablogsal @markshannon

@Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jun 5, 2022

Related issue where a user reported that code with cProfile slowed down 1.6x, and the only thing I could pinpoint was that tracing itself slowed down, not cProfile #93381.

@sweeneyde

This comment was marked as off-topic.

@sweeneyde

This comment was marked as off-topic.

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

I now realize this is probably due to using the Python tracer rather than the C tracer, will try again.

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

Okay, now with Code coverage for Python, version 6.4.1 with C extension.

Script used:
import coverage
from time import perf_counter

def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n-1) + fib(n-2)

t0 = perf_counter()
cov = coverage.Coverage()
cov.start()
fib(35)
cov.stop()
cov.save()
t1 = perf_counter()
print(t1 - t0)
Profile on 3.10 branch: took 13.3 seconds

Functions taking more than 1% of CPU:

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
| - _PyEval_EvalFrameDefault 12372 (99.78%) 3540 (28.55%) python310.dll
| - [External Call] tracer.cp310-win_amd64.pyd 2769 (22.33%) 1474 (11.89%) tracer.cp310-win_amd64.pyd
| - maybe_call_line_trace 3872 (31.23%) 1206 (9.73%) python310.dll
| - _PyCode_CheckLineNumber 1168 (9.42%) 988 (7.97%) python310.dll
| - call_trace 3431 (27.67%) 548 (4.42%) python310.dll
| - PyLong_FromLong 370 (2.98%) 369 (2.98%) python310.dll
| - _PyFrame_New_NoTrack 385 (3.11%) 338 (2.73%) python310.dll
| - lookdict_unicode_nodummy 324 (2.61%) 324 (2.61%) python310.dll
| - call_trace_protected 2107 (16.99%) 287 (2.31%) python310.dll
| - frame_dealloc 270 (2.18%) 270 (2.18%) python310.dll
| - set_add_entry 264 (2.13%) 264 (2.13%) python310.dll
| - PyDict_GetItem 579 (4.67%) 255 (2.06%) python310.dll
| - _PyEval_MakeFrameVector 633 (5.11%) 248 (2.00%) python310.dll
| - _PyEval_Vector 12372 (99.78%) 196 (1.58%) python310.dll
| - PyLineTable_NextAddressRange 180 (1.45%) 180 (1.45%) python310.dll
| - call_function 12372 (99.78%) 169 (1.36%) python310.dll
| - PyObject_RichCompare 457 (3.69%) 169 (1.36%) python310.dll
| - binary_op1 486 (3.92%) 157 (1.27%) python310.dll
| - long_richcompare 207 (1.67%) 157 (1.27%) python310.dll
Profile on main branch: took 23.0 seconds

Functions taking more than 1% of CPU:

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
| - _PyEval_EvalFrameDefault 22180 (99.87%) 4448 (20.03%) python312.dll
| - _PyCode_CheckLineNumber 5389 (24.27%) 2583 (11.63%) python312.dll
| - maybe_call_line_trace 6461 (29.09%) 1966 (8.85%) python312.dll
| - retreat 2242 (10.10%) 1644 (7.40%) python312.dll
| - [External Call] tracer.cp312-win_amd64.pyd 6274 (28.25%) 1428 (6.43%) tracer.cp312-win_amd64.pyd
| - get_line_delta 1162 (5.23%) 1162 (5.23%) python312.dll
| - unicodekeys_lookup_unicode 844 (3.80%) 728 (3.28%) python312.dll
| - _Py_dict_lookup 1476 (6.65%) 632 (2.85%) python312.dll
| - call_trace 9802 (44.14%) 630 (2.84%) python312.dll
| - siphash13 442 (1.99%) 442 (1.99%) python312.dll
| - PyUnicode_New 745 (3.35%) 376 (1.69%) python312.dll
| - pymalloc_alloc 362 (1.63%) 362 (1.63%) python312.dll
| - _PyType_Lookup 1786 (8.04%) 319 (1.44%) python312.dll
| - set_add_entry 271 (1.22%) 271 (1.22%) python312.dll
| - PyDict_GetItem 824 (3.71%) 242 (1.09%) python312.dll
| - call_trace_protected 8562 (38.55%) 236 (1.06%) python312.dll
| - PyNumber_Subtract 449 (2.02%) 221 (1.00%) python312.dll

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

This seems to suggest to me that this could be made much faster via a dedicated C API function:

https://github.com/nedbat/coveragepy/blob/master/coverage/ctracer/util.h#L41

#define MyCode_GetCode(co)      (PyObject_GetAttrString((PyObject *)(co), "co_code"))

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

Indeed, on main, if I add printf("lookup %s\n", (const char *)PyUnicode_DATA(key)); to the top of unicodekeys_lookup_unicode and run that fib script, I get the following hot loop repeated over and over:

lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
lookup fib
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup C:\Users\sween\Source\Repos\cpython2\cpython\cover.py
lookup co_code
...

Each of these calls to PyObject_GetAttrString(..., "co_code") requires PyUnicode_FromString->unicode_decode_utf8->PyUnicode_New->_PyObject_Malloc, followed by PyObject_GetAttr->_PyObject_GenericGetAttrWithDict->_PyType_Lookup->find_name_in_mro->(unicode_hash+_Py_Dict_Lookup)->unicodekeys_lookup_unicode, where it should be just a C API function call.

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

Ah! PyCode_GetCode exists since #92168

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 6, 2022

Some performance degradation under tracing is expected, but not as much as reported.
This is a deliberate tradeoff: faster execution when not tracing, for slower execution when tracing.

From the profiles that @sweeneyde gathered it looks like we are seeing slowdowns in:

  • _PyEval_EvalFrameDefault, which is expected
  • Calculation of line numbers. This is probably due to the new line table format.

The new line number table gives better error messages without taking up too much extra space, but it is more complex and thus slower to parse.
We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Once nedbat/coveragepy#1394 is merged, we can look at the profile again to see if much has changed.

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 6, 2022

New profile of fib_cover.py on main branch, with the PyCode_GetCode addition (took 18.1 seconds)
Function Name Total CPU [unit, %] Self CPU [unit, %] Module
| - _PyEval_EvalFrameDefault 19239 (99.66%) 4481 (23.21%) python312.dll
| - _PyCode_CheckLineNumber 5344 (27.68%) 2676 (13.86%) python312.dll
| - maybe_call_line_trace 6397 (33.14%) 1997 (10.34%) python312.dll
| - retreat 2197 (11.38%) 1629 (8.44%) python312.dll
| - [External Call] tracer.cp312-win_amd64.pyd 3073 (15.92%) 1360 (7.04%) tracer.cp312-win_amd64.pyd
| - get_line_delta 1038 (5.38%) 1038 (5.38%) python312.dll
| - call_trace 6630 (34.34%) 605 (3.13%) python312.dll
| - unicodekeys_lookup_unicode 549 (2.84%) 548 (2.84%) python312.dll
| - _Py_dict_lookup 1060 (5.49%) 511 (2.65%) python312.dll
| - PyDict_GetItem 929 (4.81%) 266 (1.38%) python312.dll
| - set_add_entry 265 (1.37%) 265 (1.37%) python312.dll
| - PyNumber_Subtract 447 (2.32%) 226 (1.17%) python312.dll
| - call_trace_protected 5408 (28.01%) 222 (1.15%) python312.dll
| - initialize_locals 199 (1.03%) 199 (1.03%) python312.dll

I wonder: would there be any way to retain some specializations during tracing? Some specialized opcodes are certainly incorrect to use during tracing, e.g., STORE_FAST__LOAD_FAST. However, couldn't others be retained, e.g. LOAD_GLOBAL_MODULE, or any specialized opcode that covers only one unspecialized opcode? (EDIT: maybe not, if it uses NOTRACE_DISPATCH())

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 6, 2022

We should be able to speed it up, but don't expect things to be as fast as 3.10. Sorry.

Well, two times slower than 3.10, as some of @nedbat's benchmarks show, is not acceptable in my opinion. We should try to get this below 20% at the very least.

I marked this as a release blocker and this will block any future releases.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 7, 2022

The problem with saying that something is unacceptable, is that implies there is an acceptable alternative.
What is that alternative?

What does it mean making this a release blocker? Are you not going to release 3.11 if coverage is >20% slower?

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.
I expect that will be better than a 20% slowdown, but I'm not making any pronouncements.

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 7, 2022

What is that alternative?

I don't know, but there is always the possibility to revert the changes that made coverage slower (worst case scenario).

Are you not going to release 3.11 if coverage is >20% slower?

I will check with the SC, but I can already tell you that I would not be comfortable releasing 3.11 if is 2 times slower in coverage, unless the SC mandates otherwise.

I don't think cutoffs, such as 20%, are constructive. We should do as well as we are reasonably able, regardless of whether we achieve some arbitrary number.

The problem here is that this tradeoff was not discussed anywhere or advertised and no one had the possibility to pronounce or to make tradeoffs or otherwise, and therefore we cannot just "decide" out of the blue what is and what is not acceptable because the community didn't had the change to object.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 7, 2022

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657. The proximate cause is my PR to reduce the memory consumed of PEP 657.
Would you revert PEP 657, or just accept a large increase in memory use?

I think it would be a lot more helpful if we fixed the issue, rather than making dire pronouncements.

the community didn't had the change to object.

Everything we do is public. I even opened nedbat/coveragepy#1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 7, 2022

The largest contributor to the slowdown is computing line numbers. The root cause of that is PEP 657.

That's not true and I feel is misleading. When line numbers and column offsets were separated, PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

Everything we do is public. I even opened nedbat/coveragepy#1287 back in November, so that @nedbat was aware that there might be slowdowns to coverage.py

That's not enough. If that were true we would not need to do PEPs because "what we do is already public". The point of having discussion on mailing lists is precisely so everyone can be aware of the tradeoff and give us their opinions.

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 7, 2022

In any case, there is no point to discuss in these terms because I am not the one making decisions here. As RM my only authority is to consider the current slowdown a blocker and ask for a bigger discussion to be taken if we cannot fix the issue.

Let's see what we can do and then let's discuss how to proceed once we understand our options.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 7, 2022

The root cause was merging line numbers and column offsets so you need to pay for decoding both just to get line numbers.

No, that was the proximate cause. It wouldn't have been necessary if not for PEP 657.
I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 7, 2022

I didn't implement it to make things slower, I implemented it because column and endline tables were consuming a lot of memory.

Mark, I understand that and we are on the same page. I know why you made that change, and IIRC I even reviewed it. The problem here is that that change (or other changes) had unexpected consequences on timing and we need to make decisions either trying to fix together it if we can or potentially reverting it we cannot. That's all I am saying

Claiming that PEP 657 had zero overhead is not true. The original implementation had a large overhead in terms on memory use.

The claim was that PEP 657 had zero overhead on the time computing the line numbers, not on memory.My sentence was:

PEP 657 had 0 overhead on computing line numbers, so the root cause is certainly not PEP 657.

@gpshead
Copy link
Member

@gpshead gpshead commented Jun 8, 2022

Marking this as a release or deferred blocker is fine and entirely what I'd expect any RM to want to do. That just means we collect more information and make a decision before we exit beta stages.

I don't personally have a problem with coverage being slower, even by 2x. It's more about understanding what all is impacted by this and how so we can explain our measured rationale to users. ie: Look beyond just coverage which should only be a Developer/CI time tool rather than something slowing people's "production" release tool/application/notebook code.

Q1: What else uses tracing?

Q1b: What of those use it at other than development time?

These questions are informational and do not block performance work on reducing the impact to tracing.

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jun 8, 2022

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

@Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jun 8, 2022

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

In #93383 (comment) I benchmarked adding an additional field to the code object and there was no slowdown in pyperformance. I do not know how memory consumption is affected.

I'm not sure co_extra is the right place for that info considering its for consumers of the PEP 523 API. Maybe I can merge your idea into a simple lazily created _PyCodeTracingInfo * C struct?

PyCodeObject {
    ...
     void *co_tracing_info
}

struct _PyCodeTracingInfo {
     // cached code bytes object
    PyObject *co_code;
     // decompressed line number table
    ...
}

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 8, 2022

How bad would it be to store an extra field into code objects that is the "decompressed" line number table, lazily allocated for only those code objects which are actually traced? Could it occupy co_extra or similar?

An early design for the compressed line number table did just this.
My concern, then and now, is that the uncompressed table is a lot bigger, and never goes away.

It isn't just tracing that needs line numbers, tracebacks need them too.
Which means that the uncompressed table would be created every time a code object ends up in a traceback.

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 8, 2022

@gpshead

Q1: What else uses tracing?

Debuggers and profiling tools, like cProfile.

Q1b: What of those use it at other than development time?

Hopefully, no one. Tracing has always imposed a considerable overhead.
Anyone using a profiler in production should be using a statistical profiler.

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 8, 2022

I think we should see how much faster we can make tracing with the compressed table first. If that isn't good enough, then we can reconsidered the decompressed table.

Maybe we could leave line numbers uncompressed and compress the rest?

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 8, 2022

Maybe we could leave line numbers uncompressed and compress the rest?

That's going to be complicated. Too complicated for a beta release, IMO.

Alternatively, we can leave uncompressed just the line numbers if and only if we detect a tracing or profiling function? This way the extra memory is only paid if someone is profiling or tracing.

If we add a small/large flag, then the extra memory will be 2 bytes for instructions for files < 64k lines (basically all code), and 4 bytes per instruction for giant files.
I'll try it out.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 8, 2022

I've been experimenting with streamlining the advance and maybe_call_line_trace functions. I'm getting roughly a 10% speedup, but a flat table lookup will be way faster.

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 8, 2022

Actually, if we store the line delta from co_firstline, many functions will only need 1 bytes per code unit, which should be fine.

@nedbat
Copy link
Member Author

@nedbat nedbat commented Jun 12, 2022

I'm very interested in helping to test people's proposed fixes, but am losing track of what is being changed where. We have #93493, which I have tried. Are there other pull requests in flight to try?

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 12, 2022

I would like to see how much we gain back by reverting 944fffe (not that we are planning to do it, just want to know how much the problem is due to that).

I can prepare a PR for you to test later today but in the interim you can try reverting yourself (it doesn't cleanly revert, but the conflicts are due to automatically generated files mainly that you can regenerate by running make regen-all).

@Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jun 12, 2022

I would like to see how much we gain back by reverting 944fffe (not that we are planning to do it, just want to know how much the problem is due to that).

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

python/pyperformance#213 (comment)

@pablogsal
Copy link
Member

@pablogsal pablogsal commented Jun 12, 2022

IIUC, Kumar benchmarked that commit here and got a 1.3x slowdown before and after.

That's in pyperformance, no? I meant to see how much it performs in coverage Sorry, I checked the link in more detail. Seems that this is what I wanted to check 😓 Sorry for the confusion :)

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2022

Looking at the output of perf (using an optimized, but no pgo, lto build) and the code, I see the following reasons for the slowdown:

  1. Increased overhead in the _PyEval_EvalFrameDefault determining whether the event is "call", possible "line" or no event.
  2. Increased overhead in the _PyEval_EvalFrameDefault as we no longer specialize when tracing.
  3. Increased overhead in maybe_call_line_trace finding the first traceable instruction
  4. Large slowdown in determining the line number from the index.

All but 2 can be fixed without any negative impact on performance when not tracing.
Fixing 4 will increase memory use when tracing, but when not tracing, will only need a tiny bit of extra memory.

Specializing during tracing, increases the overhead of dispatching when not-tracing, resulting in a ~1% slowdown.

@kumaraditya303
Copy link
Contributor

@kumaraditya303 kumaraditya303 commented Jun 13, 2022

FYI, I have added a coverage 1 benchmark to pyperformance to ease benchmarking coverage / tracing performance.

Footnotes

  1. python/pyperformance#213

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2022

#93763

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2022

#93769

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 14, 2022

#93769 is merged. #93818 is the backport to 3.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 release-blocker type-bug
Projects
Status: No status
Development

No branches or pull requests

7 participants