Efficient daily resampling using xarray

Good day.
I often resample hourly netcdf data (eg. ERA5) into daily means with xarray. However, when doing this for large 4D dataset (eg. ERA5 file) it is very slow. Does anyone have a more efficient method that is effective and returns an xarray object? An example of code below:

import xarray as xr
tfile = ‘/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc’
ft=xr.open_dataset(tfile).resample(time=‘1D’).mean(keep_attrs=True)

Hi @Michael_Barnes. I think you will have better experiences with coarsen. E.g.

ft=xr.open_dataset(tfile).coarsen(time=24).mean(keep_attrs=True)

will average blocks of 24 hours. Obviously, you’ll need to be a little careful that your “days” correspond to the period you want. E.g. if the first time in your dataset is a 6am, then your resulting days will be the average over 6am-5am, whereas if the first time is 12pm, then your days will be 12pm-11am.

Thanks @dougiesquire for the suggestion.
When I use coarsen, it doesn’t seem to be more efficient (time-wise) and seems to also eat up a lot of RAM which the resample option does not.

Chipping in here since it’s something I’ve also struggled with, even for “medium”-sized data…

I’ve had moderate success using CDO rather than Python/xarray for daily means. An example I’ve used is something like this:

FILES=/g/data/rt52/era5/pressure-levels/reanalysis/v/
OUTDIR=/g/data/k10/cr7888/era5_daily_means/v250/

for year in {1959..2021}; 
do 
    for month in {01,02,03,04,05,06,07,08,09,10,11,12}; 
    do 
        # select 250 hPa, regrid to 1x1 deg, take daily mean (in that order)
        cdo -L -daymean -remapcon,r360x180 -sellevel,250 $FILES/$year/v_era5_oper_pl_$year$month\01*.nc /scratch/k10/cr7888/v250/$year$month.nc
    done
    # concat monthly files into yearly
    cdo mergetime /scratch/k10/cr7888/v250/$year*.nc $OUTDIR/era5_v250_daily_mean_$year.nc
done

Not sure if/how CDO does parallelisation but it’s possible the loops slow things down (if there’s any CDO experts here feel free to make suggestions since I’m a novice myself!)

Only other thing I can think of with Python/xarray is to mess around with chunk sizes, allocated compute/memory resources, etc. Though in my experience that seems to only make small differences.

Hi @coreyrobinson. A good suggestion.
CDO does seem to be more efficient. I have tested the python tools that use cdo (nctoolkit) and it does seem to be better than xarray and compatible to the unix CDO speed. It also produces an xarray object which is convenient for my application. Below the code which I think solves the issue and is equivalent to the original.

import nctoolkit as nc
import xarray as XR
tfile = ‘/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc’
cdo_sub_command = “-daymean”
ds = nc.open_data(tfile)
ds.cdo_command(cdo_sub_command)
ft = ds.to_xarray()

1 Like

@Michael_Barnes, it looks like you’re happy with the cdo solution, which is great. But, just to add to my previous answer, using dask and appropriate chunking will help here (sorry, I didn’t originally realize the size of your input data).

For example, your initial code took me approximately 12 mins to run using a “Large” ARE instance (7 cpus, 32GB mem). The following takes 1.5 mins:

import xarray as xr
from distributed import Client
Client()

tfile = "/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc"
hourly = xr.open_dataset(tfile, chunks={"time": 24, "level": 2})
daily = hourly.coarsen(time=24).mean(keep_attrs=True).compute()

Is this competitive with using cdo?

Note, using resample instead of coarsen above takes approximately 3 mins.

1 Like

Before giving up on xarray altogether make sure you’re using the latest version of dask as there has been a major update that fixes a lot of these memory issues

It is something @Scott looked at in the past, and developed climtas for this purpose

https://climtas.readthedocs.io/en/latest/

but definitely check the latest dask doesn’t solve the problem before trying climtas

Heh, we crossed quotes @dougiesquire

Is this with the latest dask version?

No, this is actually using an old version of dask. @Michael_Barnes’ original code doesn’t actually use dask at all. Just for fun, I’ll try update dask and see if this makes any difference. I suspect not as memory is actually kept under control in this workflow, but who knows??

Using dask version 2022.12.0 (instead of 2022.9.2) seems to slow things down marginally. Closer to 2 mins.

Oh right, well isn’t that the answer then? Without using dask and chunking xarray will try and load alll the data at once, and blow out memory, possibly leading to spilling to on disk swap, which is horribly slow.

Might be interesting to know how many years you have to process before it starts to struggle with the memory over-production problem, i.e. where is the cutover in speed between old and new dask.

I would expect that the new dask default queuing would runtime would remain approximately linear in speed with increasing size of the time dimension, but that the previous implementation would start to display rapidly diverging run time before failing to run at all.