New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change shutil.rmtree and os.walk to support very deep hierarchies #89727
Comments
It is possible to create deep directory hierarchies that cannot be removed via shutil.rmtree or walked via os.walk, because these functions exceed the interpreter recursion limit. This may have security implications for web services (e.g. various webdisks) that have to clean up user-created mess or walk through it. [aep@aep-haswell ~]$ mkdir /tmp/badstuff
[aep@aep-haswell ~]$ cd /tmp/badstuff
[aep@aep-haswell badstuff]$ for x in `seq 2048` ; do mkdir $x ; cd $x ; done
[aep@aep-haswell 103]$ cd
[aep@aep-haswell ~]$ python
Python 3.9.7 (default, Oct 10 2021, 15:13:22)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shutil
>>> shutil.rmtree('/tmp/badstuff')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/shutil.py", line 726, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
[Previous line repeated 992 more times]
File "/usr/lib/python3.9/shutil.py", line 642, in _rmtree_safe_fd
fullname = os.path.join(path, entry.name)
File "/usr/lib/python3.9/posixpath.py", line 77, in join
sep = _get_sep(a)
File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>> import os
>>> list(os.walk('/tmp/badstuff'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
[Previous line repeated 993 more times]
File "/usr/lib/python3.9/os.py", line 412, in _walk
new_path = join(top, dirname)
File "/usr/lib/python3.9/posixpath.py", line 77, in join
sep = _get_sep(a)
File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>> |
Use a stack to implement os.walk iteratively instead of recursively to avoid hitting recursion limits on deeply nested trees.
Use a stack to implement os.walk iteratively instead of recursively to avoid hitting recursion limits on deeply nested trees.
This also affects |
Use a stack to implement os.walk iteratively instead of recursively to avoid hitting recursion limits on deeply nested trees.
Use a stack to implement os.walk iteratively instead of recursively to avoid hitting recursion limits on deeply nested trees.
…f github.com:ovsyanka83/cpython into pythongh-89727/fix-pathlib.Path.walk-recursion-depth
I am unsure on the bug vs feature classification here. On one side it seems like a clear problem and the fixes don’t change function signatures or the way they are used, but on the other side there is no guarantee that extremely deep hierachies are supported, and I wonder if there could be negative effects from the ocde changes (for example, can the (Adding PR reviewers: @carljm @brettcannon @serhiy-storchaka) |
I don't have strong feelings about calling it a bug vs a feature; I would tend to call it a bug because the function just fails in a subset of cases that aren't obviously outside its domain. I agree that it's a bug that we could just document as a known limitation, though I don't see any reason to do that when the fix is not that difficult. I guess the only real impact of how it's classified is whether the fix would be backported?
Sure, but in the current code the Python stack would instead grow very large in the same scenario. And the Python stack frames (plus everything they reference) are almost certainly larger than the stack elements tracked in the iterative version, so if anything I would expect the iterative version to also save memory in the case of a very deep traversal. |
Yes exactly, the impact of my question is backporting or not. I take it you feel ok about backporting the fix; let’s wait to see what a core dev thinks. |
I would classify this as a feature request. There are plenty of things in CPython for which you can you run out of stack space for and it isn't considered a bug in those instances (and that's just part of general limitations that CPython has). |
(To be clear, I don't think there's a strong case for backporting this, so reasoning backwards from the conclusion I'm quite happy to call it a feature :P ) |
Are |
Yes, it looks like it. |
I'll give this a shot. |
I spotted a problem with using This is a wart we removed when we added
I'm going to have a crack at this. |
Add a `follow_junctions` argument to `pathlib.Path.walk()`. When set to false, directory junctions are treated as files. Add an `fwalk()` method to `pathlib.Path`, analogous to `os.fwalk()`. Implement `shutil.rmtree()` using `pathlib.Path.walk()` and `fwalk()`.
Won't this be significantly slower? On my system, with a tree with ~5000 directories under it, Why not adjust |
Preferably |
Yep good point, it might be too slow. I also measure OTOH,
What does the API look like? |
In both Arguably
|
I was being serious. Maybe |
Even with
Sorry, I didn't mean to imply you weren't being serious, only that I thought the naming wasn't obvious and that |
Say |
You're totally right, of course Eryk :-). I must have made a mistake when I tried that approach previously, as when I try it now I get zero test failures from |
Maybe that is close enough to just use
It could simply mean introducing a kwarg like
Either of these options could also help avoid introducing Personally I don't see an ideal solution, which can also comfortably made to be public-facing. Even if |
What sort of thing would |
If new functions are added, I suggest |
It would simply replace members of Of course if we just want a separate implementation, public or not, then we could maybe just get rid of |
I'd be interested to see an implementation! You'd need to support For now I'm -1 on adding new public functions. I wrestled with the |
Returning
The split between I'd prefer to return |
I think my suggestion wasn't clear enough - I also might be missing something.
Right, I'm discussing these as separate matters. Imagine that we take these steps:
Maybe a simpler way to think about this: We want the behavior of I didn't mean to suggest that we remove the split between
You might be right. Anyways, I need to think a little more about it and maybe I can get an implementation done soon to demonstrate. |
My view (open to change) is that we don't need a I have a WIP implementation using pathlib in #103164 The only downside I can see is performance, and we haven't yet put a number to it. I'll try to do that soon. I think only Windows will be affected, if that. (Is it possible we'll make That said, I'm an ardent pathlib fanboy, so it would be good to get other opinions on this. |
Microsoft has improved this for the cases in which a file or directory is open with delete sharing (e.g. watching a directory for changes, or a malware scanner). In Windows 10, >>> os.mkdir('spam')
>>> h = win32file.CreateFile('spam', 0x0001_0000, 7, None, 3, 0x0200_0000, None)
>>> os.rmdir('spam')
>>> win32file.GetFinalPathNameByHandle(h, 0)
'\\\\?\\C:\\$Extend\\$Deleted\\000B000000041FC810B3AEBC' OTOH, if a file or directory is opened without delete sharing (e.g. via Python's In issue #101601, I proposed adding job object support to |
Here's a rough draft of walk/fwalk functions that return It needs a couple changes, including the @barneygale I think your |
I've raised this on discourse: https://discuss.python.org/t/using-iterative-filesystem-walk-instead-of-recursive/21955/22 |
…ythonGH-100282) Use a stack to implement `pathlib.Path.walk()` iteratively instead of recursively to avoid hitting recursion limits on deeply nested trees. Co-authored-by: Barney Gale <barney.gale@gmail.com> Co-authored-by: Brett Cannon <brett@python.org>
Add `pathlib.Path.fwalk()` method, which behaves exactly like `Path.walk()` except that it yields a 4-tuple `(dirpath, dirnames, filenames, dirfd)`, and it supports a `dir_fd`. This method provides safety from symlink attacks when walking directory trees; this is important for implementing functionality such as `rmtree()`.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
shutil.rmtree()
recursion error on deep trees. #103164pathlib.Path.fwalk()
method #103566The text was updated successfully, but these errors were encountered: