Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rmtree & copy method to pathlib #92771

Closed
Conchylicultor opened this issue May 13, 2022 · 25 comments
Closed

Add rmtree & copy method to pathlib #92771

Conchylicultor opened this issue May 13, 2022 · 25 comments
Labels
stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement

Comments

@Conchylicultor
Copy link
Contributor

Conchylicultor commented May 13, 2022

Feature or enhancement

Currently pathlib is missing some shutil features: rmtree or copy

Pitch

pathlib.Path define an Interface / API that many third party modules implement:

Having a consistent API is great as I can write code for pathlib, and it is automatically compatible with all pathlib-like backends without any code changes.
(e.g. user can replace pathlib.Path by epath.Path and the function magically work with Butt storage likegs://, s3://,...).

However due to the lack of .rmtree and .copy supports, some operations are impossible. It means each pathlib-like backend who want to support recursive directory deletion or copy has to come up with their own API, leading to inconsistency.

(technically it is possible to implement some rmtree(path.Path()) implemented only using pathlib function, but this would lead in very inefficient performance on remote storage)

Defining them directly in pathlib would make them part of the standard, with well defined semantic. This would allow implementations to rely on them and to switch between pathlib-like objects.

This will be even more relevant when pathlib will provide an explicit interface that users an inherit from: https://discuss.python.org/t/make-pathlib-extensible/3428/7

@Conchylicultor Conchylicultor added the type-feature A feature request or enhancement label May 13, 2022
@AlexWaygood AlexWaygood added the stdlib Python modules in the Lib dir label May 13, 2022
@Ovsyanka83
Copy link
Contributor

Ovsyanka83 commented May 13, 2022

rmtree seems entirely possible, considering we will have walk() soon. I'll have to benchmark to see if we can use my walk() implementation for this (as it can be a bit slow due to the use of stat()).

Considering we have rename(), copy seems to be possible as well.

We could also just wrap shutil's functions in pathlib but then Path._scandir will not be utilized; also shutil is not currently used in pathlib at all -- and we might have a good reason for it.

I believe that @barneygale would also be interested in this conversation.

@Conchylicultor Conchylicultor changed the title Add rmtree, copy,... method to pathlib Add rmtree & copy method to pathlib May 14, 2022
@Ovsyanka83
Copy link
Contributor

@domdfcoding
Copy link
Contributor

I recall a discussion (although despite searching I can't find it) that pathlib shouldn't replicate (or wrap or otherwise provide) this sort of functionality, since the version in shutil is fine. The intention was solely to provide an alternative to os.path et al.

I sort of agree, but I do think there is a compelling case for the two methods you're proposing in the context of Barney's abstract path concept.

@Ovsyanka83
Copy link
Contributor

@domdfcoding I feel like pathlib can and should support most functionality of both os.path and shutil. It feels natural to use rmtree, rmdir, unlink, walk, etc on Path objects.

@brettcannon
Copy link
Member

To be clear, pathlib is not expected to fully supplant everything in os or shutil; they exist separately for a reason. It's going to be a per-feature use-case.

@brettcannon
Copy link
Member

And to add on to my last comment, each addition needs to somehow improve add to the situation, else it's just wrapping a function call in a method and that isn't helpful enough to need to support that API for decades.

/cc @barneygale

@barneygale
Copy link
Contributor

else it's just wrapping a function call in a method and that isn't helpful enough to need to support that API for decades.

Wrapping a function call in a method is 50% of pathlib, and it made pathlib wildly popular even before it was added to the standard library. The helpfulness comes from the object-oriented API and the DSL-like goodness of constructing paths with /. And I'm hoping to introduce an AbstractPath in future that will allow the likes of cloudpathlib to provide their own implementations, further increasing the utility.

I'd consider the following shutil functions candidates for wrapping in pathlib: move(), copy(), copytree(), rmtree(), chown(). In each case they have similar utility to existing pathlib methods like symlink_to() or rmdir().

@Conchylicultor do you think you could log a separate issue for copy(), and reduce this one to only covering rmtree()? Thanks

@Ovsyanka83
Copy link
Contributor

@brettcannon could you explain what reason there is for pathlib to not support, for example, a majority of functionality of shutil? I've been wondering about that for a while now.

@merwok
Copy link
Member

merwok commented May 20, 2022

In my recollection, pathlib’s goal is to provide a class abstraction to replace os.path modules, but it is not meant to encompass all operations that work with files in the stdlib. That is why the fspath protocol was added and many modules gained support for strings or Paths. So IMO it’s good and clean if you can call shutil.rmtree(some_path), but there is no strong need to add some_path.rmtree().

@Ovsyanka83
Copy link
Contributor

Good point. I still believe that reading functionality should be a part of pathlib (such as walk) but it makes sense that things like rmtree are not exactly necessary there.

@brettcannon
Copy link
Member

@barneygale You're right, making an object-oriented API that cleverly leveraged operator overloading is what has made pathlib successful. But there is a worry of scope creep and trying to keep the API small enough to fit in your head. For instance, .copy() and even more so for copytree() doesn't add anything as you are not manipulating the path object, you're just performing an action on it; the __fspath__ protocol makes this easily compatible with pathlib objects. So the question becomes how does the method add convenience for most users? (I mention your AbstractPath later 😉.)

In all of this, you have to also understand that pathlib quite possibly would not have a lot of what's in there today as it predates __fspath__. For instance, .open() quite possibly wouldn't exist on pathlib.Path since open() itself now natively supports pathlib objects. So things that are not directly related to path manipulation or don't somehow have an added benefit to being on pathlib (e.g. returning a pathlib object itself) very well wouldn't make sense today.

Now I fully admit this is all subjective, but this is what we have to think through with all of this. Remember, any API we add will need to be supported for literally decades and I have to assume I will be the one doing the support (I have personally been on Python's development team for 19 years and I still have code from my first commit being used in the stdlib). And since we are not deprecating shutil here as that's not really benefiting the preexisting users, you have to have a really good reason for increasing the amount of code you're asking me to support for a couple of decades (on top of everything else I already have to support and will need to support in the future).

Now, I do understand that @barneygale has plans for an AbstractPath so we can have a universal API for path and file manipulation that can transcend file systems. Great! But that dream has not come true yet, so that isn't an argument (yet) to increase the API service. Once again, I have been around long enough to see good intentions not be completed to rely on it to make other decisions. This isn't a race; Python 3.12, 3.13, 3.14, etc. will still get released, so there's no rush to get anything in now. But yes, AbstractPath could change how we look at a lot of this.

@Ovsyanka83 what @merwok said covers it nicely; common actions can/should be on pathlib, but things that are lower-level or more rare should lean on __fspath__ . But there's also the consideration that duplicated APIs are twice the APIs to support since tossing out shutil for pathlib is a wide-ranging breaking change.

@Conchylicultor
Copy link
Contributor Author

Conchylicultor commented May 21, 2022

That is why the fspath protocol was added and many modules gained support for strings or Paths.

The fspath protocol is not a solution here, because it assume local filesystem. For example, all the following will fail:

shutils.rmtree(epath.Path('gs://bucket/a'))  # Fail
shutils.rmtree(universal_pathlib.UPath('s3://bucket/a'))  # Fail
shutils.copy(zipfile.Path('x.zip', 'file.txt'), 'dst/file.txt')  # Fail

Copying my original message:

Having a consistent API is great as I can write code for pathlib, and it is automatically compatible with all pathlib-like backends without any code changes.
Defining them directly in pathlib would make them part of the standard, with well defined semantic. This would allow implementations to rely on them and to switch between pathlib-like objects.

So I can write path.rmtree(), path.copy(x, y) and it will automatically work with any pathlib-like object (supporting the protocol). Relying on os.path / shutils is not possible for us, because it means not supporting remote file system (gs://,...)

If all program where using the pathlib-protocol in their program instead of os.fspath, they would support Google Cloud & other by default, without any code change.

@Ovsyanka83
Copy link
Contributor

Ovsyanka83 commented May 21, 2022

To be honest, the more I read this discussion, the more I side with Brett.

Thinking realistically: imagine that google cloud library implements its own rmtree, copy, etc. It won't be very nice but it will give you +- the same interface. Additionally, PathLab can also add support for such things. If it works and becomes the de-facto approach, then maybe some day we can merge a portion of its methods with pathlib. And before that, pathlab will only be one pip install away.

The more I think about the ways we could reach this full vision with everyone following pathlib interface, the more complex the vision gets. Let's figure out exactly why shutil and os are divided, then try to come up with a vision of system-independent full (copy, move, rmtree, etc) path interface (ideally where only low-level methods would need to be implemented), and then turn it into a full blown PEP. That way we will either be disillusioned with the whole idea or we will have a solid motivation section.

Update: After some research, I consider shutil's and os's division slightly arbitrary. To put it simply: os tries to be a thin wrapper around c stdlib and is intended to have a similar api with it. Shutil, on the other hand, doesn't. If that's all there is, I feel like merging their functionality within pathlib isn't such a big issue. However, we will need to find (or become) a core developer who agrees to support it for decades :)

@merwok
Copy link
Member

merwok commented May 23, 2022

So I can write path.rmtree(), path.copy(x, y) and it will automatically work with any pathlib-like object (supporting the protocol)

Ah! That is indeed a valuable use case. I can easily imagine some web dev script that removes temporary files, not caring if it’s a local directory for local dev or some cloud storage bucket on the server.

@barneygale
Copy link
Contributor

barneygale commented Jul 14, 2022

I spotted this in PEP 428:

More operations could be provided, for example some of the functionality of the shutil module.

I've gone off the idea of wrapping the original functions or duplicating their implementations. I'm beginning to think we should actually move some functions from shutil to pathlib and deprecate the originals. But before we can do that, we'd need to make all kinds of paths representable in pathlib, particularly unnormalized and bytes paths.

@Ovsyanka83
Copy link
Contributor

I agree with the idea of moving them into pathlib. Though I believe that os.path should be deprecated first.

@merwok
Copy link
Member

merwok commented Jul 14, 2022

I don’t think there is a plan for deprecation.

os.path is a foundation module that works with strings (similar to basic syscalls)
shutil is higher-level functions (like the mv, cp, rm, etc programs)
pathlib provides classes for a different way to do os.path and some shutil operations

We have different levels for doing things with files. It’s not duplication (same things in multiple places) but different interfaces.

@barneygale
Copy link
Contributor

I don't agree with deprecating os.path. It's the right place for low-level path functionality. I'd like to eventually make a proper argument for moving certain things from shutil to pathlib, but it's a non-starter at the moment, and I'll probably change my mind before it becomes possible! :D

@Ovsyanka83
Copy link
Contributor

No problem with that :)

Nevertheless, I'm really happy that we're actively working on improving pathlib and I'm so excited for Barney's plan for extensible Paths.

@brettcannon
Copy link
Member

I believe that os.path should be deprecated first.

I don’t think there is a plan for deprecation.

I'll be more explicit and say there are no plans, nor will there ever be any plans, to deprecate os.path. There's a bigger chance of making os a proper package and that even has no chance of every occurring since way too many people do import os; os.path. The amount of code that would break with either change w/o a meaningful benefit would be astronomical (and that's a bit of wordplay as I bet the Python code now on Mars uses os.path somehow 😉).

@merwok
Copy link
Member

merwok commented Jul 15, 2022

(os could become a real package while still preserving auto-import of path submodule — but even that would have very small benefits!)

@Ovsyanka83
Copy link
Contributor

Okay-okay, I retract my comment :)

@merwok
Copy link
Member

merwok commented Jul 15, 2022

FTR in my message I was also reacting to this;

I'm beginning to think we should actually move some functions from shutil to pathlib and deprecate the originals

shutil is fine and good to work with string paths. Thinking about what methods to add to pathlib.Path is okay too.

@Ovsyanka83
Copy link
Contributor

Ovsyanka83 commented Oct 27, 2022

After much thought I tend to agree with @brettcannon and @merwok . Once the dream of universal API comes closer to fruition, implementing rmtree, copytree, etc will make a lot more sense and will be a safe change. So I'm inclined to vote for closing this issue and my pull request connected to it.

Are we in agreement here? If not, I can clean up my PR and await Brett's review on it.

@brettcannon
Copy link
Member

@Ovsyanka83 I appreciate the thought and work you have put into this! I think I'm personally convinced enough to close this at this point. Since both rmtree() and copy() are purely algorithmic I don't think there is something special being exposed here, nor is there any special ergonomics that is beneficial by having it in pathlib (like e.g. Path.walk()). Now if someone wanted to open an issue to consider using methods from pathlib in shutils then that could be considered for those abstracting file systems that way where __fspath__ does not work for them.

@brettcannon brettcannon closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

7 participants