Efficient daily resampling using xarray

Michael_Barnes · 5 December 2022 05:30

Good day.
I often resample hourly netcdf data (eg. ERA5) into daily means with xarray. However, when doing this for large 4D dataset (eg. ERA5 file) it is very slow. Does anyone have a more efficient method that is effective and returns an xarray object? An example of code below:

import xarray as xr
tfile = ‘/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc’
ft=xr.open_dataset(tfile).resample(time=‘1D’).mean(keep_attrs=True)

dougiesquire · 5 December 2022 05:49

Hi @Michael_Barnes. I think you will have better experiences with coarsen. E.g.

ft=xr.open_dataset(tfile).coarsen(time=24).mean(keep_attrs=True)

will average blocks of 24 hours. Obviously, you’ll need to be a little careful that your “days” correspond to the period you want. E.g. if the first time in your dataset is a 6am, then your resulting days will be the average over 6am-5am, whereas if the first time is 12pm, then your days will be 12pm-11am.

Michael_Barnes · 5 December 2022 06:37

Thanks @dougiesquire for the suggestion.
When I use coarsen, it doesn’t seem to be more efficient (time-wise) and seems to also eat up a lot of RAM which the resample option does not.

coreyrobinson · 5 December 2022 23:47

Chipping in here since it’s something I’ve also struggled with, even for “medium”-sized data…

I’ve had moderate success using CDO rather than Python/xarray for daily means. An example I’ve used is something like this:

FILES=/g/data/rt52/era5/pressure-levels/reanalysis/v/
OUTDIR=/g/data/k10/cr7888/era5_daily_means/v250/

for year in {1959..2021}; 
do 
    for month in {01,02,03,04,05,06,07,08,09,10,11,12}; 
    do 
        # select 250 hPa, regrid to 1x1 deg, take daily mean (in that order)
        cdo -L -daymean -remapcon,r360x180 -sellevel,250 $FILES/$year/v_era5_oper_pl_$year$month\01*.nc /scratch/k10/cr7888/v250/$year$month.nc
    done
    # concat monthly files into yearly
    cdo mergetime /scratch/k10/cr7888/v250/$year*.nc $OUTDIR/era5_v250_daily_mean_$year.nc
done

Not sure if/how CDO does parallelisation but it’s possible the loops slow things down (if there’s any CDO experts here feel free to make suggestions since I’m a novice myself!)

Only other thing I can think of with Python/xarray is to mess around with chunk sizes, allocated compute/memory resources, etc. Though in my experience that seems to only make small differences.

Michael_Barnes · 6 December 2022 00:20

Hi @coreyrobinson. A good suggestion.
CDO does seem to be more efficient. I have tested the python tools that use cdo (nctoolkit) and it does seem to be better than xarray and compatible to the unix CDO speed. It also produces an xarray object which is convenient for my application. Below the code which I think solves the issue and is equivalent to the original.

import nctoolkit as nc
import xarray as XR
tfile = ‘/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc’
cdo_sub_command = “-daymean”
ds = nc.open_data(tfile)
ds.cdo_command(cdo_sub_command)
ft = ds.to_xarray()

dougiesquire · 6 December 2022 02:47

@Michael_Barnes, it looks like you’re happy with the cdo solution, which is great. But, just to add to my previous answer, using dask and appropriate chunking will help here (sorry, I didn’t originally realize the size of your input data).

For example, your initial code took me approximately 12 mins to run using a “Large” ARE instance (7 cpus, 32GB mem). The following takes 1.5 mins:

import xarray as xr
from distributed import Client
Client()

tfile = "/g/data/rt52/era5/pressure-levels/reanalysis/t/2022/t_era5_oper_pl_20220301-20220331.nc"
hourly = xr.open_dataset(tfile, chunks={"time": 24, "level": 2})
daily = hourly.coarsen(time=24).mean(keep_attrs=True).compute()

Is this competitive with using cdo?

Note, using resample instead of coarsen above takes approximately 3 mins.

Aidan · 6 December 2022 02:48

Before giving up on xarray altogether make sure you’re using the latest version of dask as there has been a major update that fixes a lot of these memory issues

It is something @Scott looked at in the past, and developed climtas for this purpose

https://climtas.readthedocs.io/en/latest/

but definitely check the latest dask doesn’t solve the problem before trying climtas

Aidan · 6 December 2022 02:49

Heh, we crossed quotes @dougiesquire

Is this with the latest dask version?

dougiesquire · 6 December 2022 02:52

No, this is actually using an old version of dask. @Michael_Barnes’ original code doesn’t actually use dask at all. Just for fun, I’ll try update dask and see if this makes any difference. I suspect not as memory is actually kept under control in this workflow, but who knows??

dougiesquire · 6 December 2022 03:05

Using dask version 2022.12.0 (instead of 2022.9.2) seems to slow things down marginally. Closer to 2 mins.

Aidan · 6 December 2022 03:07

Oh right, well isn’t that the answer then? Without using dask and chunking xarray will try and load alll the data at once, and blow out memory, possibly leading to spilling to on disk swap, which is horribly slow.

Aidan · 6 December 2022 03:11

Might be interesting to know how many years you have to process before it starts to struggle with the memory over-production problem, i.e. where is the cutover in speed between old and new dask.

I would expect that the new dask default queuing would runtime would remain approximately linear in speed with increasing size of the time dimension, but that the previous implementation would start to display rapidly diverging run time before failing to run at all.

Topic		Replies	Views
ERA5 Daily data Technical help , inscope , duplicate	15	397	7 April 2025
Xarray changes in handling ERA5 and ERA5-Land data Technical python	2	171	20 June 2024
Lazy sorting with dask (bootstrapping problem) Technical python , dask , xarray	3	285	29 June 2023
Regridding reanalysis to n48 results in too cold climatology Technical python , data , data-analysis , atmosphere	5	376	25 June 2023
Improving performance of a Python-based function General python , help , outofscope , vectorisation	13	359	29 October 2023

Efficient daily resampling using xarray

Related topics