I have several 25Gb files on disk that are chunked in time as separate files on disk. I want to read in a very skinny slice of this data amounting to ~50Mb and perform a Fourier transform in time. However, even when I explicitly call .rechunk or set chunks = {"time":None}, the Fourier transform fails because as far as Dask is concerned my data is still chunked in time, and FFTs require time to be in a single chunk.
As a workaround I’m doing the following:
open slice of data with xr.open_mfdataset( ... ,chunks = {"zi":1,"xq":500}).isel( yh = 50 )
Perform simple interpolation along one spatial dimension .interp(xq = xh)
Load my slice into memory with .load() TAKES 12 MINS PER EXPERIMENT ON WHOLE NODE
Perform remaining analysis (takes 0.3s)
doing a simple spatial interpolation just before the FFT step. However this takes about 12 mins to do which seems stupidly slow for such a simple operation, followed by about 0.3 seconds to perform the FFT on loaded data.
This is annoying because I need to perform this operation more than 100 times and I know the current method is super inefficient
Does calling data.chunk(chunks={"time": -1}) produce an array with more than one chunk in the time dimension? Also, what are you using to do the fft? xrft?
Depending on how much data you’re dealing with, you may have more luck first saving an intermediate dataset with the chunking you need and then computing the ffts on that.
The only way I’ve been able to make the FFT work is using the ‘load’ command - nothing else seems to work. Re-saving would also work but I’d expect that re-saving to disk would take at least as long as loading the data to memory right?
Using the .rechunk() method does work. BUT this can only be applied directly to dask arrays, not xarrays. It seems that the chunks option which xarray uses to wrap the dask chunking simply doesn’t do what you expect. However, the below shenanigans do work. Basically I’m overwriting the xarray data with a manually rechunked dask array
That works! Thanks so much - I had no idea about the difference. So the chunks argument will chunk each individual file as its loaded, but the .chunk() method is the equivalent of .rechunk() for dask arrays?
In the past I guess it’s confused me that .rechunk() doesn’t work on xarrays but I guess I forget about the .chunk method as being the xarray equivalent / wrapper
Yup, its most efficient to use the chunks argument when opening the data, where possible (see here). But this chunking is applied before the combine step in open_mfdataset. In your case, this means chunk sizes in time will be truncated at the length of the time dimension in the files you’re combining.
I think it’s probably a good idea to do all the chunking you can using the chunks argument in open_mfdataset, and only chunk time using the .chunk() method, e.g.,
Thanks Dougie! This makes a lot of sense. 12-15 mins of processing time down to 30s is a good effort
3 Likes
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
10
This seems so useful it would be good to “surface” this somehow. Probably the best way would be a COSIMA Recipe @ashjbarnes, assuming this is COSIMA data?
I took the liberty of sprinkling some tag-magic to help with folks finding this.