Skip to content

DOC: NPZ files could mention more explicitly that reading is lazy/delayed #22435

Open
@rem3-1415926

Description

@rem3-1415926

Describe the issue:

Accessing the object returned by np.load() on a .npz file results in a BadZipFile exception if a .npz file of the same name is written in the mean time. The error descritpion is either CRC (see example below) if the array names match, or "File name in directory [...] differ" if they don't.

Interestingly enough, this does not happen if the first element of the loaded object is accessed before the np.savez_compressed() that breaks it afterwards. (line commented out in example code). Accessing another element doesn't have this effect, but it doesn't hurt either.

Reproduce the code example:

import numpy as np

fn = "test.npz"
np.savez_compressed(fn, d1= np.array([1,2]), d2= np.array([11,22]))

backup = np.load(fn)

# backup[backup.files[0]]

np.savez_compressed(fn, d1= np.array([33,33]), d2= np.array([3,3]))

backup[backup.files[0]]

Error message:

File "/home/▓▓▓▓/py_venvs/default/lib/python3.8/site-packages/numpy/lib/npyio.py", line 241, in __getitem__
    magic = bytes.read(len(format.MAGIC_PREFIX))
  File "/usr/lib/python3.8/zipfile.py", line 940, in read
    data = self._read1(n)
  File "/usr/lib/python3.8/zipfile.py", line 1030, in _read1
    self._update_crc(data)
  File "/usr/lib/python3.8/zipfile.py", line 958, in _update_crc
    raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'd1.npy'

# OR, with different array names on the second write:
  File "/home/▓▓▓▓/py_venvs/default/lib/python3.8/site-packages/numpy/lib/npyio.py", line 240, in __getitem__
    bytes = self.zip.open(key)
  File "/usr/lib/python3.8/zipfile.py", line 1556, in open
    raise BadZipFile(
zipfile.BadZipFile: File name in directory 'd1.npy' and header b'd3.npy' differ.

NumPy/Python version information:

Python 3.8.10 / Numpy 1.23.4

Context for the issue:

Happens when trying to backup previous data to restore it at a later point when the new data was already written. There are numerous ways around it, such as the above mentioned hack, or immediately reading the arrays into a dict and then closing the loaded file object properly (as, if I understand correctly, you should do anyway)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions