Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ast.get_source_segment is slower than it needs to be because it reads every line of the source. #103285

Closed
JBushagour opened this issue Apr 5, 2023 · 3 comments
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@JBushagour
Copy link

JBushagour commented Apr 5, 2023

Bug report

There is a private function _splitlines_no_ff which is only ever called in ast.get_source_segment. This functions splits the entire source given to it, but ast.get_source_segment only needs at most node.end_lineo lines to work.

cpython/Lib/ast.py

Lines 308 to 330 in 1acdfec

def _splitlines_no_ff(source):
"""Split a string into lines ignoring form feed and other chars.
This mimics how the Python parser splits source code.
"""
idx = 0
lines = []
next_line = ''
while idx < len(source):
c = source[idx]
next_line += c
idx += 1
# Keep \r\n together
if c == '\r' and idx < len(source) and source[idx] == '\n':
next_line += '\n'
idx += 1
if c in '\r\n':
lines.append(next_line)
next_line = ''
if next_line:
lines.append(next_line)
return lines

cpython/Lib/ast.py

Lines 344 to 378 in 1acdfec

def get_source_segment(source, node, *, padded=False):
"""Get source code segment of the *source* that generated *node*.
If some location information (`lineno`, `end_lineno`, `col_offset`,
or `end_col_offset`) is missing, return None.
If *padded* is `True`, the first line of a multi-line statement will
be padded with spaces to match its original position.
"""
try:
if node.end_lineno is None or node.end_col_offset is None:
return None
lineno = node.lineno - 1
end_lineno = node.end_lineno - 1
col_offset = node.col_offset
end_col_offset = node.end_col_offset
except AttributeError:
return None
lines = _splitlines_no_ff(source)
if end_lineno == lineno:
return lines[lineno].encode()[col_offset:end_col_offset].decode()
if padded:
padding = _pad_whitespace(lines[lineno].encode()[:col_offset].decode())
else:
padding = ''
first = padding + lines[lineno].encode()[col_offset:].decode()
last = lines[end_lineno].encode()[:end_col_offset].decode()
lines = lines[lineno+1:end_lineno]
lines.insert(0, first)
lines.append(last)
return ''.join(lines)

If, for example, you want to extract an import line from a very long file, this can seriously degrade performance.

The introduction of a max_lines kwarg in _splitlines_no_ff which functions like maxsplit in str.split would minimize unneeded work. An implementation of the proposed fix is below (which makes my use case twice as fast):

--- a/Lib/ast.py
+++ b/Lib/ast.py
@@ -305,11 +305,16 @@ def get_docstring(node, clean=True):
     return text
 
 
-def _splitlines_no_ff(source):
+def _splitlines_no_ff(source, max_lines=-1):
     """Split a string into lines ignoring form feed and other chars.
 
     This mimics how the Python parser splits source code.
+
+    If max_lines is given, at most max_lines will be returned. If max_lines is not
+    specified or negative, then there is no limit on the number of lines returned.
     """
+    if not max_lines:
+        return []
     idx = 0
     lines = []
     next_line = ''
@@ -323,6 +328,8 @@ def _splitlines_no_ff(source):
             idx += 1
         if c in '\r\n':
             lines.append(next_line)
+            if max_lines == len(lines):
+                return lines
             next_line = ''
 
     if next_line:
@@ -360,7 +367,7 @@ def get_source_segment(source, node, *, padded=False):
     except AttributeError:
         return None
 
-    lines = _splitlines_no_ff(source)
+    lines = _splitlines_no_ff(source, max_lines=end_lineno + 1)
     if end_lineno == lineno:
         return lines[lineno].encode()[col_offset:end_col_offset].decode()
 

Your environment

  • CPython versions tested on: 3.11

Linked PRs

@JBushagour JBushagour added the type-bug An unexpected behavior, bug, or error label Apr 5, 2023
@hugovk hugovk added the performance Performance or resource usage label Apr 5, 2023
@AlexWaygood AlexWaygood added stdlib Python modules in the Lib dir 3.12 bugs and security fixes labels Apr 5, 2023
@gaogaotiantian
Copy link
Member

gaogaotiantian commented Apr 5, 2023

I won't consider this as a "bug", but it's something that we can definitely improve. Breaking on max_lines is nice, but I think iterating by characters(and creating new strings on the way) is a huge performance hit as well. I think we can just use internal string manipulations here to improve performance. I can work on some benchmark on this, I assume it would be faster.

@gaogaotiantian
Copy link
Member

I tried to replicate the exact behavior of the current _splitlines_no_ff function because I don't know if there's any specific concerns the author had for new line breakers, which is:

  • Consider \n, \r\n and \r as line breakers
  • Convert \r\n to \n
  • Keep the line breaker as it is (do not convert \r)

This is not trivial in pure string manipulation, so I used re, which provided useful iterating matching.

Core code is

_line_pattern = re.compile(r"(.*?(?:\r\n|\n|\r|$))")
def _splitlines_no_ff(source, maxlines=-1):
    lines = []
    for lineno, match in enumerate(_line_pattern.finditer(source), 1):
        if maxlines > 0 and lineno > maxlines:
            break
        lines.append(match[0].replace("\r\n", "\n"))
    return lines

Pre-compile regex helped a bit in benchmarks, we can switch that to inline compile if that's considered ugly. The regex + replace almost replicated the behavior exactly except for the fact that it will end up with an empty string at the end when the source is not terminated by \n. I think that's completely acceptable - with maxlines input that line should not be used at all anyway.

>>> re.findall(r"(.*?(?:\r\n|\r|\n|$))", "a\f\rb\n\nc\r\ndd")
['a\x0c\r', 'b\n', '\n', 'c\r\n', 'dd', '']

Benchmark code as below:

import timeit

short_setup = f"""
import ast
code = \"\"\"def fib(x):
    if x < 2:
        return 1
    return fib(x - 1) + fib(x - 2)
\"\"\"
module_node = ast.parse(code)
function_node = module_node.body[0]
"""

long_setup_start = f"""
import ast
with open("Lib/inspect.py") as f:
    code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[2]
"""

long_setup_end = f"""
import ast
with open("Lib/inspect.py") as f:
    code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[-2]
"""

test = """
ast.get_source_segment(code, function_node)
"""

print(f"short:      {timeit.timeit(test, setup=short_setup, number=10000)}")
print(f"long+start: {timeit.timeit(test, setup=long_setup_start, number=10)}")
print(f"long+end:   {timeit.timeit(test, setup=long_setup_end, number=10)}")

We tested on a very short source code and a long one(inspect.py has about 3k lines). For the long one, we tested an ast that's at the very beginning, which should benefit from the maxlines a lot, and an ast that's at the end of the file.

The result of current implementation with no optimization:

short:      0.12191030000394676
long+start: 0.15196170000126585
long+end:   0.15174370000022464

The result of improved method using re and maxlines:

short:      0.041647400001238566
long+start: 0.0010708999980124645
long+end:   0.03278889999637613

As we can tell from the result, even on very short source code, the new implementation has a ~3x speed up, which is due to the elimination of character-level loop. For the long source code, the speed up is even more obvious ~4x-5x.

maxlines provided huge performance improvement as expected - much less loops when we can stop early. This improvement should be linear to the ratio of source code length and our target end line number.

Overall, I believe this is a promising improvement to ast.get_source_segment().

@hauntsaninja
Copy link
Contributor

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants