Skip to content

gh-51067: Add remove() and repack() to ZipFile #134627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

danny0838
Copy link

@danny0838 danny0838 commented May 24, 2025

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing a str path or ZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when a str path is provided.
  • Returns the removed ZipInfo instance.
  • Supported in modes: 'a', 'w', 'x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode 'a'.

Rationales

Heuristics Used in repack()

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    • Any extra bytes between a file entry and the next one are removed.
    • Any data before the first recorded entry is preserved unless it appears to be a sequence of valid and consecutive local file entries, in which case those entries are removed.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. #51067 (comment)).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.



📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

@bedevere-app
Copy link

bedevere-app bot commented May 24, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

sharktide

This comment was marked as off-topic.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:29
Copy link
Contributor

@sharktide sharktide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

@danny0838
Copy link
Author

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resembles open() and write(), which raises a ValueError in various cases. Furthermore there has been a change from raising RuntimeError since Python 3.6:

Changed in version 3.6: Calling open() on a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

Changed in version 3.6: Calling write() on a ZipFile created with mode 'r' or a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:58
@danny0838
Copy link
Author

danny0838 commented May 24, 2025

Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

danny0838 added 6 commits May 25, 2025 16:18
- Separate individual validation tests.
- Check underlying repacker not called in validation.
- Use `unlink` to prevent FileNotFoundError.
- Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.
- Move the protection logic to `ZipFile.repack()`.
sharktide

This comment was marked as resolved.

@merwok

This comment was marked as resolved.

@emmatyping
Copy link
Member

@sharktide I understand you are trying to be helpful, but please do not instruct contributors on process, especially if that process is not written down somewhere. There's no requirement for a short PR description, in fact, the long description here is a benefit in my opinion (I would like to review but there is a lot of context to consider!).

@sharktide
Copy link
Contributor

Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails!

@emmatyping emmatyping self-requested a review May 26, 2025 05:12
@emmatyping
Copy link
Member

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned about repack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Copy link
Contributor

@sharktide sharktide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips?

Maybe, but I think that would just add extra complexity and a whole new plethora of issues. LGTM so far!

@danny0838 Thanks for the PR!

@danny0838
Copy link
Author

danny0838 commented May 26, 2025

@emmatyping

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned about repack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Added support of repack(removed) to pass a sequence of removed ZipInfos, and reclaims space only for corresponding local file headers. So one can do something like:

with zipfile.ZipFile('archive.zip', 'a') as zh:
    zinfos = [zh.remove(n) for n in zh.namelist() if n.startswith('folder/')]
    zh.repack(zinfos)

Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants