Very different zarr file sizes from virtually identical write operations

dougrichardson · 6 July 2023 00:16

I have computed two nearly identical metrics from daily ERA5 temperature data, given by the below functions calc_cdd and calc_hdd. As you can see, the only difference is the direction of the inequality in .where() and the order of the subtraction.

def calc_cdd(T, comfort=24):
    return (T - comfort).where(T > comfort, 0)

def calc_hdd(T, comfort=18):
    return (comfort - T).where(T < comfort, 0)

I spin up a full node using PBScluster, load my temperature data (~90 GB) with what I think is sensible chunking, apply the functions and a few other simple operations, and chunk again before writing to zarr:

T = xr.open_mfdataset(era_path+"2t/daily/*.nc", chunks={"time": "200MB"})

T = T - 273.15

cdd = calc_cdd(T)
cdd = cdd.rename({"t2m": "cdd"})

# Need to chunk again so that we have uniform chunk sizes
cdd = cdd.chunk({"time": "200MB"})

cdd.to_zarr(
    era_path + "/derived/cdd_24_era5_daily_1959-2022.zarr",
    mode="w",
    consolidated=True
)

The computation for calc_hdd takes a bit longer than for calc_cdd. More concerningly, the resultant zarr files have very different sizes, with the former being over double the size:

What are the possible reasons for this? It doesn’t necessarily matter for my workflow, but if there’s a way to keep them both as small as possible that would be convenient.

dougiesquire · 6 July 2023 01:51

Hi @dougrichardson. I expect the difference in size is just due to the compression applied by zarr. Are the uncompressed sizes the same if you read your zarr collections back in using xarray (e.g. see the size listed in the Array column and Bytes row of the DataArray html repr)?

It’s possibly also worth checking that all the zarr chunks were successfully written. With consolidate=True it’s possible to create a zarr collection that appears fine when you first open it lazily (from the consolidated metadata), but actually has data missing (e.g. if the writing of the chunks failed at some point, or if some chunks were deleted after the collection was created). For peace of mind, you could make sure that there are no NaNs in your zarr collections.

dougrichardson · 6 July 2023 02:00

Thanks @dougiesquire. The uncompressed sizes are the same, and there are no NaNs, so compression seems to be the answer. It doesn’t make sense to me given the two operations are nearly the same, but also doesn’t really matter.

dougiesquire · 6 July 2023 02:27

File compression algorithms are complex, but in general they try to get rid of redundancy (I think zarr uses blosc by default, if you want to read about it). So if there are lots of the same or similar values in a file, it will be easier to get a high compression ratio.

Aidan · 6 July 2023 05:49

I might expect there to be a big difference if you were masking out large contiguous parts of the output and the sizes of the masked area was markedly different in the two cases.

You’re masking to zero, do the sizes of those masked areas differ a lot in the two cases?

dougrichardson · 7 July 2023 03:46

Hi @Aidan, I just checked this and there are many more zeros in the smaller zarr store.

Aidan · 7 July 2023 05:14

Right, so that is consistent due to the compressibility of essentially zero information.

Topic		Replies	Views
Zarr 2.14.0 includes experimental support for sharding Technical python , storage	12	441	27 August 2025
Notebook that takes too long Technical python , cosima	12	492	22 January 2023
Payu collate tiles Technical cosima	7	274	5 June 2023
Adding calculated values to xarray DataArray takes too long Technical	6	251	28 November 2022
Xarray to_zarr causing errors in new xp65 env that weren't present in hh5 General python , help , dask , climate-conda-enviro	7	131	12 August 2025

Very different zarr file sizes from virtually identical write operations

Related topics