Zarr inodes

Following on from the discussion about reducing Zarr inodes at the working group meeting today and in relation to Experiment Proposal: Processing Global km-scale Hackathon Data

If anyone has had success with storing Zarr with Shards or using a zip format then could you please respond in this thread with how you did that. I’ve had mixed success with this so far so it would be very useful to know what others are doing.

1 Like

Hi Sam,

I have used code like this:

def zip_zarr(zarr_filename, zip_filename):
"""Zip a zarr collection.
Parameters
----------
zarr_filename : str
    Path to (unzipped) zarr collection
zip_filename : str
    Path to output zipped zarr collection
"""

with zipfile.ZipFile(
    zip_filename, "w", compression=zipfile.ZIP_STORED, allowZip64=True
) as fh:
    for root, _, filenames in os.walk(zarr_filename):
        for each_filename in filenames:
            each_filename = os.path.join(root, each_filename)
            fh.write(each_filename, os.path.relpath(each_filename, zarr_filename))

to zip an existing Zarr directory. This, I believe, reduces the inode usage to 1.

I can than re-open the zipped Zarr file with

inputs = zarr.storage.ZipStore(zip_filename, mode='r')
ds = xr.open_zarr(inputs)

However, in my testing zipping the Zarr file up comes with a penalty when trying to read it back in. In my testing reading took double the amount of time compared to the un-zipped (original) Zarr file.

2 Likes

GDAL can access data within a zip via /vsizip, and that will perform better than native zip if using SOzip. I’m working on a GDAL backend for xarray (best improv is to be able to use multidim), so I will try this out at some point.

1 Like