New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ast.get_source_segment
is slower than it needs to be because it reads every line of the source.
#103285
Comments
I won't consider this as a "bug", but it's something that we can definitely improve. Breaking on |
I tried to replicate the exact behavior of the current
This is not trivial in pure string manipulation, so I used Core code is _line_pattern = re.compile(r"(.*?(?:\r\n|\n|\r|$))")
def _splitlines_no_ff(source, maxlines=-1):
lines = []
for lineno, match in enumerate(_line_pattern.finditer(source), 1):
if maxlines > 0 and lineno > maxlines:
break
lines.append(match[0].replace("\r\n", "\n"))
return lines Pre-compile regex helped a bit in benchmarks, we can switch that to inline compile if that's considered ugly. The regex + replace almost replicated the behavior exactly except for the fact that it will end up with an empty string at the end when the source is not terminated by >>> re.findall(r"(.*?(?:\r\n|\r|\n|$))", "a\f\rb\n\nc\r\ndd")
['a\x0c\r', 'b\n', '\n', 'c\r\n', 'dd', ''] Benchmark code as below: import timeit
short_setup = f"""
import ast
code = \"\"\"def fib(x):
if x < 2:
return 1
return fib(x - 1) + fib(x - 2)
\"\"\"
module_node = ast.parse(code)
function_node = module_node.body[0]
"""
long_setup_start = f"""
import ast
with open("Lib/inspect.py") as f:
code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[2]
"""
long_setup_end = f"""
import ast
with open("Lib/inspect.py") as f:
code = f.read()
module_node = ast.parse(code)
function_node = module_node.body[-2]
"""
test = """
ast.get_source_segment(code, function_node)
"""
print(f"short: {timeit.timeit(test, setup=short_setup, number=10000)}")
print(f"long+start: {timeit.timeit(test, setup=long_setup_start, number=10)}")
print(f"long+end: {timeit.timeit(test, setup=long_setup_end, number=10)}") We tested on a very short source code and a long one( The result of current implementation with no optimization:
The result of improved method using
As we can tell from the result, even on very short source code, the new implementation has a ~3x speed up, which is due to the elimination of character-level loop. For the long source code, the speed up is even more obvious ~4x-5x.
Overall, I believe this is a promising improvement to |
Thanks again! |
Bug report
There is a private function
_splitlines_no_ff
which is only ever called inast.get_source_segment
. This functions splits the entire source given to it, butast.get_source_segment
only needs at mostnode.end_lineo
lines to work.cpython/Lib/ast.py
Lines 308 to 330 in 1acdfec
cpython/Lib/ast.py
Lines 344 to 378 in 1acdfec
If, for example, you want to extract an import line from a very long file, this can seriously degrade performance.
The introduction of a
max_lines
kwarg in_splitlines_no_ff
which functions likemaxsplit
instr.split
would minimize unneeded work. An implementation of the proposed fix is below (which makes my use case twice as fast):Your environment
Linked PRs
The text was updated successfully, but these errors were encountered: