-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-51067: Add remove()
and repack()
to ZipFile
#134627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have
This behavior simply resembles
|
Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces. The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted. |
- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking. - Move the protection logic to `ZipFile.repack()`.
This comment was marked as resolved.
This comment was marked as resolved.
@sharktide I understand you are trying to be helpful, but please do not instruct contributors on process, especially if that process is not written down somewhere. There's no requirement for a short PR description, in fact, the long description here is a benefit in my opinion (I would like to review but there is a lot of context to consider!). |
Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails! |
@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles. I'm a bit concerned about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips?
Maybe, but I think that would just add extra complexity and a whole new plethora of issues. LGTM so far!
@danny0838 Thanks for the PR!
Added support of
Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes. |
- According to ZIP spec, both uncompressed and compressed size should be 0xffffffff when zip64 is used.
This is a revised version of PR #103033, implementing two new methods in
zipfile.ZipFile
:remove()
andrepack()
, as suggested in this comment.Features
ZipFile.remove(zinfo_or_arcname)
str
path orZipInfo
) from the central directory.str
path is provided.ZipInfo
instance.'a'
,'w'
,'x'
.ZipFile.repack(removed=None)
removed
is passed (as a sequence of removedZipInfo
s), only their corresponding local file entry data are removed.'a'
.Rationales
Heuristics Used in
repack()
Since
repack()
does not immediately clean up removed entries at the time aremove()
is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.While local file entries begin with the magic signature
PK\x03\x04
, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.To safely reclaim space,
repack()
assumes that in a normal ZIP file, local file entries are stored consecutively:BadZipFile
error is raised and no changes are made.Check the doc in the source code of
_ZipRepacker.repack()
(which is internally called byZipFile.repack()
) for more details.Supported Modes
There has been opinions that a repacking should support mode
'w'
and'x'
(e. g. #51067 (comment)).This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode
'w'
and'x'
may be used on an unseekable file and will be broken by such change. OTOH, mode'a'
is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/