`ast.get_source_segment` is slower than it needs to be because it reads every line of the source. #103285

JBushagour · 2023-04-05T15:59:51Z

Bug report

There is a private function _splitlines_no_ff which is only ever called in ast.get_source_segment. This functions splits the entire source given to it, but ast.get_source_segment only needs at most node.end_lineo lines to work.

cpython/Lib/ast.py

Lines 308 to 330 in 1acdfec

    
           def _splitlines_no_ff(source): 
        
               """Split a string into lines ignoring form feed and other chars. 
        
               This mimics how the Python parser splits source code. 
        
               """ 
        
               idx = 0 
        
               lines = [] 
        
               next_line = '' 
        
               while idx < len(source): 
        
                   c = source[idx] 
        
                   next_line += c 
        
                   idx += 1 
        
                   # Keep \r\n together 
        
                   if c == '\r' and idx < len(source) and source[idx] == '\n': 
        
                       next_line += '\n' 
        
                       idx += 1 
        
                   if c in '\r\n': 
        
                       lines.append(next_line) 
        
                       next_line = '' 
        
               if next_line: 
        
                   lines.append(next_line) 
        
               return lines

cpython/Lib/ast.py

Lines 344 to 378 in 1acdfec

    
           def get_source_segment(source, node, *, padded=False): 
        
               """Get source code segment of the *source* that generated *node*. 
        
               If some location information (`lineno`, `end_lineno`, `col_offset`, 
        
               or `end_col_offset`) is missing, return None. 
        
               If *padded* is `True`, the first line of a multi-line statement will 
        
               be padded with spaces to match its original position. 
        
               """ 
        
               try: 
        
                   if node.end_lineno is None or node.end_col_offset is None: 
        
                       return None 
        
                   lineno = node.lineno - 1 
        
                   end_lineno = node.end_lineno - 1 
        
                   col_offset = node.col_offset 
        
                   end_col_offset = node.end_col_offset 
        
               except AttributeError: 
        
                   return None 
        
               lines = _splitlines_no_ff(source) 
        
               if end_lineno == lineno: 
        
                   return lines[lineno].encode()[col_offset:end_col_offset].decode() 
        
               if padded: 
        
                   padding = _pad_whitespace(lines[lineno].encode()[:col_offset].decode()) 
        
               else: 
        
                   padding = '' 
        
               first = padding + lines[lineno].encode()[col_offset:].decode() 
        
               last = lines[end_lineno].encode()[:end_col_offset].decode() 
        
               lines = lines[lineno+1:end_lineno] 
        
               lines.insert(0, first) 
        
               lines.append(last) 
        
               return ''.join(lines)

If, for example, you want to extract an import line from a very long file, this can seriously degrade performance.

The introduction of a max_lines kwarg in _splitlines_no_ff which functions like maxsplit in str.split would minimize unneeded work. An implementation of the proposed fix is below (which makes my use case twice as fast):

--- a/Lib/ast.py
+++ b/Lib/ast.py
@@ -305,11 +305,16 @@ def get_docstring(node, clean=True):
     return text
 
 
-def _splitlines_no_ff(source):
+def _splitlines_no_ff(source, max_lines=-1):
     """Split a string into lines ignoring form feed and other chars.
 
     This mimics how the Python parser splits source code.
+
+    If max_lines is given, at most max_lines will be returned. If max_lines is not
+    specified or negative, then there is no limit on the number of lines returned.
     """
+    if not max_lines:
+        return []
     idx = 0
     lines = []
     next_line = ''
@@ -323,6 +328,8 @@ def _splitlines_no_ff(source):
             idx += 1
         if c in '\r\n':
             lines.append(next_line)
+            if max_lines == len(lines):
+                return lines
             next_line = ''
 
     if next_line:
@@ -360,7 +367,7 @@ def get_source_segment(source, node, *, padded=False):
     except AttributeError:
         return None
 
-    lines = _splitlines_no_ff(source)
+    lines = _splitlines_no_ff(source, max_lines=end_lineno + 1)
     if end_lineno == lineno:
         return lines[lineno].encode()[col_offset:end_col_offset].decode()

Your environment

CPython versions tested on: 3.11

Linked PRs

gh-103285: Rewrite _splitlines_no_ff to improve performance #103307

The text was updated successfully, but these errors were encountered:

gaogaotiantian · 2023-04-05T17:48:31Z

I won't consider this as a "bug", but it's something that we can definitely improve. Breaking on max_lines is nice, but I think iterating by characters(and creating new strings on the way) is a huge performance hit as well. I think we can just use internal string manipulations here to improve performance. I can work on some benchmark on this, I assume it would be faster.

gaogaotiantian · 2023-04-06T04:31:50Z

I tried to replicate the exact behavior of the current _splitlines_no_ff function because I don't know if there's any specific concerns the author had for new line breakers, which is:

Consider \n, \r\n and \r as line breakers
Convert \r\n to \n
Keep the line breaker as it is (do not convert \r)

This is not trivial in pure string manipulation, so I used re, which provided useful iterating matching.

Core code is

_line_pattern = re.compile(r"(.*?(?:\r\n|\n|\r|$))")
def _splitlines_no_ff(source, maxlines=-1):
    lines = []
    for lineno, match in enumerate(_line_pattern.finditer(source), 1):
        if maxlines > 0 and lineno > maxlines:
            break
        lines.append(match[0].replace("\r\n", "\n"))
    return lines

Pre-compile regex helped a bit in benchmarks, we can switch that to inline compile if that's considered ugly. The regex + replace almost replicated the behavior exactly except for the fact that it will end up with an empty string at the end when the source is not terminated by \n. I think that's completely acceptable - with maxlines input that line should not be used at all anyway.

>>> re.findall(r"(.*?(?:\r\n|\r|\n|$))", "a\f\rb\n\nc\r\ndd")
['a\x0c\r', 'b\n', '\n', 'c\r\n', 'dd', '']

Benchmark code as below:

import timeit

short_setup = f"""
import ast
code = \"\"\"def fib(x):
    if x < 2:
        return 1
    return fib(x - 1) + fib(x - 2)
\"\"\"
module_node = ast.parse(code)
function_node = module_node.body[0]
"""

long_setup_start = f"""
import ast
with open("Lib/inspect.py") as f:
    code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[2]
"""

long_setup_end = f"""
import ast
with open("Lib/inspect.py") as f:
    code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[-2]
"""

test = """
ast.get_source_segment(code, function_node)
"""

print(f"short:      {timeit.timeit(test, setup=short_setup, number=10000)}")
print(f"long+start: {timeit.timeit(test, setup=long_setup_start, number=10)}")
print(f"long+end:   {timeit.timeit(test, setup=long_setup_end, number=10)}")

We tested on a very short source code and a long one(inspect.py has about 3k lines). For the long one, we tested an ast that's at the very beginning, which should benefit from the maxlines a lot, and an ast that's at the end of the file.

The result of current implementation with no optimization:

short:      0.12191030000394676
long+start: 0.15196170000126585
long+end:   0.15174370000022464

The result of improved method using re and maxlines:

short:      0.041647400001238566
long+start: 0.0010708999980124645
long+end:   0.03278889999637613

As we can tell from the result, even on very short source code, the new implementation has a ~3x speed up, which is due to the elimination of character-level loop. For the long source code, the speed up is even more obvious ~4x-5x.

maxlines provided huge performance improvement as expected - much less loops when we can stop early. This improvement should be linear to the ratio of source code length and our target end line number.

Overall, I believe this is a promising improvement to ast.get_source_segment().

hauntsaninja · 2023-04-24T05:04:04Z

Thanks again!

JBushagour added the type-bug An unexpected behavior, bug, or error label Apr 5, 2023

hugovk added the performance Performance or resource usage label Apr 5, 2023

AlexWaygood added stdlib Python modules in the Lib dir 3.12 bugs and security fixes labels Apr 5, 2023

gaogaotiantian mentioned this issue Apr 6, 2023

gh-103285: Rewrite _splitlines_no_ff to improve performance #103307

Merged

hauntsaninja pushed a commit that referenced this issue Apr 24, 2023

gh-103285: Rewrite _splitlines_no_ff to improve performance (#103307)

3686013

hauntsaninja closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ast.get_source_segment` is slower than it needs to be because it reads every line of the source. #103285

`ast.get_source_segment` is slower than it needs to be because it reads every line of the source. #103285

JBushagour commented Apr 5, 2023 •

edited by bedevere-bot

gaogaotiantian commented Apr 5, 2023 •

edited

gaogaotiantian commented Apr 6, 2023

hauntsaninja commented Apr 24, 2023

ast.get_source_segment is slower than it needs to be because it reads every line of the source. #103285

ast.get_source_segment is slower than it needs to be because it reads every line of the source. #103285

Comments

JBushagour commented Apr 5, 2023 • edited by bedevere-bot

Bug report

Your environment

Linked PRs

gaogaotiantian commented Apr 5, 2023 • edited

gaogaotiantian commented Apr 6, 2023

hauntsaninja commented Apr 24, 2023

`ast.get_source_segment` is slower than it needs to be because it reads every line of the source. #103285

`ast.get_source_segment` is slower than it needs to be because it reads every line of the source. #103285

JBushagour commented Apr 5, 2023 •

edited by bedevere-bot

gaogaotiantian commented Apr 5, 2023 •

edited