Very different zarr file sizes from virtually identical write operations

I have computed two nearly identical metrics from daily ERA5 temperature data, given by the below functions calc_cdd and calc_hdd. As you can see, the only difference is the direction of the inequality in .where() and the order of the subtraction.

def calc_cdd(T, comfort=24):
    return (T - comfort).where(T > comfort, 0)

def calc_hdd(T, comfort=18):
    return (comfort - T).where(T < comfort, 0)

I spin up a full node using PBScluster, load my temperature data (~90 GB) with what I think is sensible chunking, apply the functions and a few other simple operations, and chunk again before writing to zarr:

T = xr.open_mfdataset(era_path+"2t/daily/*.nc", chunks={"time": "200MB"})

T = T - 273.15

cdd = calc_cdd(T)
cdd = cdd.rename({"t2m": "cdd"})

# Need to chunk again so that we have uniform chunk sizes
cdd = cdd.chunk({"time": "200MB"})

    era_path + "/derived/cdd_24_era5_daily_1959-2022.zarr",

The computation for calc_hdd takes a bit longer than for calc_cdd. More concerningly, the resultant zarr files have very different sizes, with the former being over double the size:

What are the possible reasons for this? It doesn’t necessarily matter for my workflow, but if there’s a way to keep them both as small as possible that would be convenient.

1 Like

Hi @dougrichardson. I expect the difference in size is just due to the compression applied by zarr. Are the uncompressed sizes the same if you read your zarr collections back in using xarray (e.g. see the size listed in the Array column and Bytes row of the DataArray html repr)?

It’s possibly also worth checking that all the zarr chunks were successfully written. With consolidate=True it’s possible to create a zarr collection that appears fine when you first open it lazily (from the consolidated metadata), but actually has data missing (e.g. if the writing of the chunks failed at some point, or if some chunks were deleted after the collection was created). For peace of mind, you could make sure that there are no NaNs in your zarr collections.

Thanks @dougiesquire. The uncompressed sizes are the same, and there are no NaNs, so compression seems to be the answer. It doesn’t make sense to me given the two operations are nearly the same, but also doesn’t really matter.

File compression algorithms are complex, but in general they try to get rid of redundancy (I think zarr uses blosc by default, if you want to read about it). So if there are lots of the same or similar values in a file, it will be easier to get a high compression ratio.

1 Like

I might expect there to be a big difference if you were masking out large contiguous parts of the output and the sizes of the masked area was markedly different in the two cases.

You’re masking to zero, do the sizes of those masked areas differ a lot in the two cases?

Hi @Aidan, I just checked this and there are many more zeros in the smaller zarr store.

Right, so that is consistent due to the compressibility of essentially zero information.