Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-113225: Speed up pathlib.Path.glob() #113226

Merged
merged 6 commits into from Jan 4, 2024
Merged

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Dec 17, 2023

Use os.DirEntry.path as the string representation of child paths.

Corner case: when we os.scandir('.'), the path atttribute will resemble './foo', which isn't normalized by pathlib standards. In this case we use the entry name, which looks like 'foo'.

Use `os.DirEntry.path` as the string representation of child paths, unless
the parent path is empty, in which case we use the entry `name`.
@barneygale
Copy link
Contributor Author

barneygale commented Dec 17, 2023

Up to 15% faster in some simple tests:

$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('**/*', follow_symlinks=True))"
5 loops, best of 5: 78.8 msec per loop  # before
5 loops, best of 5: 67.9 msec per loop  # after
# --> 1.16x faster

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*', follow_symlinks=True))"
5 loops, best of 5: 79.6 msec per loop  # before
5 loops, best of 5: 70.5 msec per loop  # after
# --> 1.13x faster

edit: patch revised; test cases for iterdir() removed.

@serhiy-storchaka
Copy link
Member

How does it work for [p.name for p in Path.cwd().iterdir()]?

@barneygale
Copy link
Contributor Author

How does it work for [p.name for p in Path.cwd().iterdir()]?

Same result but much slower (!), as the .name attribute triggers path parsing:

$ ./python -m timeit -s "from pathlib import Path" "[p.name for p in Path.cwd().iterdir()]"
2000 loops, best of 5: 139 usec per loop  # before
1000 loops, best of 5: 371 usec per loop  # after

Will fix! Thanks.

@barneygale barneygale changed the title GH-113225: Speed up pathlib.Path.iterdir() and glob() GH-113225: Speed up pathlib.Path.glob() Dec 18, 2023
@barneygale
Copy link
Contributor Author

I've undone the change to iterdir() as it's hard to predict usage patterns.

With glob() the goal is to maintain a normalized string path as cheaply as possible. The string path is passed to scandir() and matched against a regex when expanding ** patterns.

@barneygale
Copy link
Contributor Author

I've merged the ABC-specific changes separately (#113556) so this PR should be pretty laser-focused now :D

@barneygale
Copy link
Contributor Author

Patch further revised so that we continue to set _drv, _root and _tail_cached as before, which might be important for some users. The speedup is reduced to about 3-5%:

$ ./python -m timeit -s "from pathlib import Path" "list(Path().glob('**/*', follow_symlinks=True))"
5 loops, best of 5: 84.3 msec per loop  # before
5 loops, best of 5: 80.1 msec per loop  # after

$ ./python -m timeit -s "from pathlib import Path" "list(Path.cwd().glob('**/*', follow_symlinks=True))"
5 loops, best of 5: 84.4 msec per loop  # before
5 loops, best of 5: 82   msec per loop  # after

@barneygale barneygale enabled auto-merge (squash) January 4, 2024 20:38
@barneygale barneygale merged commit c2e8298 into python:main Jan 4, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants