Dask remove time chunks for Fourier transforms

ashjbarnes · 29 August 2023 23:15

I’ve got a bit of a dask riddle:

I have several 25Gb files on disk that are chunked in time as separate files on disk. I want to read in a very skinny slice of this data amounting to ~50Mb and perform a Fourier transform in time. However, even when I explicitly call .rechunk or set chunks = {"time":None}, the Fourier transform fails because as far as Dask is concerned my data is still chunked in time, and FFTs require time to be in a single chunk.

As a workaround I’m doing the following:

open slice of data with xr.open_mfdataset( ... ,chunks = {"zi":1,"xq":500}).isel( yh = 50 )
Perform simple interpolation along one spatial dimension .interp(xq = xh)
Load my slice into memory with .load() TAKES 12 MINS PER EXPERIMENT ON WHOLE NODE
Perform remaining analysis (takes 0.3s)

doing a simple spatial interpolation just before the FFT step. However this takes about 12 mins to do which seems stupidly slow for such a simple operation, followed by about 0.3 seconds to perform the FFT on loaded data.

This is annoying because I need to perform this operation more than 100 times and I know the current method is super inefficient

Is there a better way to approach this?

aekiss · 29 August 2023 23:34

Might be worth asking Paige Martin how she handled this in her model.

dougiesquire · 29 August 2023 23:45

Does calling data.chunk(chunks={"time": -1}) produce an array with more than one chunk in the time dimension? Also, what are you using to do the fft? xrft?

Depending on how much data you’re dealing with, you may have more luck first saving an intermediate dataset with the chunking you need and then computing the ffts on that.

ashjbarnes · 30 August 2023 00:12

Hmm doing the -1 chunking still leaves the spatial chunks there as circled below

The only way I’ve been able to make the FFT work is using the ‘load’ command - nothing else seems to work. Re-saving would also work but I’d expect that re-saving to disk would take at least as long as loading the data to memory right?

ashjbarnes · 30 August 2023 00:31

Ok so for those following along at home:

Using the .rechunk() method does work. BUT this can only be applied directly to dask arrays, not xarrays. It seems that the chunks option which xarray uses to wrap the dask chunking simply doesn’t do what you expect. However, the below shenanigans do work. Basically I’m overwriting the xarray data with a manually rechunked dask array

Unless I’m missing something here maybe I’ll raise an issue on xarray to point this out?

dougiesquire · 30 August 2023 00:36

The time chunking will need to be done after the open_mfdataset step because you have separate files in time. Does calling

eslice.chunk(chunks={"time": -1})

still not give you a single chunk in time?

ashjbarnes · 30 August 2023 01:18

That works! Thanks so much - I had no idea about the difference. So the chunks argument will chunk each individual file as its loaded, but the .chunk() method is the equivalent of .rechunk() for dask arrays?

In the past I guess it’s confused me that .rechunk() doesn’t work on xarrays but I guess I forget about the .chunk method as being the xarray equivalent / wrapper

dougiesquire · 30 August 2023 01:59

Yup, its most efficient to use the chunks argument when opening the data, where possible (see here). But this chunking is applied before the combine step in open_mfdataset. In your case, this means chunk sizes in time will be truncated at the length of the time dimension in the files you’re combining.

I think it’s probably a good idea to do all the chunking you can using the chunks argument in open_mfdataset, and only chunk time using the .chunk() method, e.g.,

eslice = xr.open_mfdataset(
    paths,
    chunks={"time": -1, "zi": 1, "xh": 500, etc},
    etc
).chunk(chunks={"time": -1})

ashjbarnes · 30 August 2023 02:03

Thanks Dougie! This makes a lot of sense. 12-15 mins of processing time down to 30s is a good effort

Aidan · 30 August 2023 06:02

This seems so useful it would be good to “surface” this somehow. Probably the best way would be a COSIMA Recipe @ashjbarnes, assuming this is COSIMA data?

I took the liberty of sprinkling some tag-magic to help with folks finding this.

Topic		Replies	Views
A reference for making dask work (faster) Technical dask , knowledge-base	11	1314	22 July 2025
How to efficiently chunk data for faster processing and plotting? Technical python , cosima , access-om2	5	176	29 September 2024
2nd May 2025 - Unlocking the Power of Xarray and Dask 2025 training program	7	179	15 May 2025
Intake vs mfdataset Technical python , help , inscope , intake	19	114	27 May 2025
Xarray warnings while loading data using cosima cookbook Technical python , help , cosima	16	458	2 December 2024

Dask remove time chunks for Fourier transforms

Related topics