Zarr 2.14.0 includes experimental support for sharding

dougiesquire · 10 February 2023 23:32

The latest release of zarr (2.14.0) includes experimental support for sharding, along with a number of other features from the v3 spec.

Previous zarr specs stored one chunk per storage object. This can be problematic for zarr stores with a large number of chunks due to design constraints of the underlying storage (e.g. inode limits). Sharding allows storing multiple chunks in one storage object.

Full zarr v3 spec here
Details on zarr sharding here

Aidan · 12 February 2023 22:38

Thanks for sharing @dougiesquire

In practice are there recommended minimum sizes for storage objects in S3 buckets? Or put another way, maximum number of objects per bucket?

I can see the utility of reducing the inode count on lustre file systems, but I thought the current work-around for this was to compress the whole zarr directory tree? I suppose that gets problematic for extremely large datasets? Are there are any other reasons to favour sharding over compressing the whole zarr directory structure?

It is great that this functionality now exists, but I can now see a similar issue arising for zarr that we’ve had for a long time with netCDF: aligning read/write chunk size to object size on disk. Not a deal breaker, but a subtlety that often needs to be taken into account.

dougiesquire · 12 February 2023 23:47

My understanding is that there is no limit to the number of objects you can store in a bucket, but latency overheads still can be problematic for large numbers of small objects.

Yes, I’ve zipped some pretty large zarr stores (order 10s TB), but it doesn’t feel like an ideal work-around:

zarr.ZipStores are not safe to write in multiple processes and there’s no way to update an existing zip file without unzipping and rezipping. Zipping/unzipping can take a really long time for large stores.
I’ve hit hard limits on file size on some systems.
Dividing a dataset across multiple ZipStores helps with the above, but this can be a pain to manage and I’ve experienced unexplained performance issues doing this (newer versions of dask may be better).

Yes, though from my skim through the specs I’m not sure how this issue has changed for zarr with the introduction of sharding. My (ignorant) understanding is that read/write will still occur at the chunk level.

Topic		Replies	Views
Very different zarr file sizes from virtually identical write operations Technical zarr , python	6	315	7 July 2023
Icechunk: Earthmover open-sources their ArrayLake backend Technical zarr , data , database	10	98	9 December 2024
Finding a way to iterate using the input of two xarray dataarrays when chunked Technical help , outofscope	9	117	20 January 2025
SCIDIR: A scientific software distribution repository for bringing reproducible software containers securely to HPCs in Australia Technical conference , container , reproducibility , workflow , eresearch , docker	0	163	27 October 2023
ACCESS-rAM3 Beta Feedback Regional Nesting Suite regional , atmosphere , feedback , beta , access-ram	26	194	1 May 2025

Zarr 2.14.0 includes experimental support for sharding

Related topics