Also see my post from just last week here, indicating the following resources:
- Parallel computing with Dask - in particular see Optimization tips at the bottom. This contains some suggestions on how to get dask to play nicely with operations such as
groupby
(for subtracting climatologies). - Best Practices — Dask documentation - best practices for dask.
A few other bits and pieces:
A key point here is that if chunks
is not specified, no chunking will be done (if open_mfdataset
is used then the chunk size will be the file size).
You can see the native chunking of variables in a netcdf file using ncdump -hs <filename>
.
I’ve found it useful and efficient to save some intermediate results that are expensive to calculate to .zarr
stores using a for loop over sections of the datasets (.isel
) and appending to the file as described here.