# Dask remove time chunks for Fourier transforms

I’ve got a bit of a dask riddle:

I have several 25Gb files on disk that are chunked in time as separate files on disk. I want to read in a very skinny slice of this data amounting to ~50Mb and perform a Fourier transform in time. However, even when I explicitly call `.rechunk` or set `chunks = {"time":None}`, the Fourier transform fails because as far as Dask is concerned my data is still chunked in time, and FFTs require time to be in a single chunk.

As a workaround I’m doing the following:

• open slice of data with `xr.open_mfdataset( ... ,chunks = {"zi":1,"xq":500}).isel( yh = 50 )`
• Perform simple interpolation along one spatial dimension `.interp(xq = xh)`
• Load my slice into memory with `.load()` TAKES 12 MINS PER EXPERIMENT ON WHOLE NODE
• Perform remaining analysis (takes 0.3s)

doing a simple spatial interpolation just before the FFT step. However this takes about 12 mins to do which seems stupidly slow for such a simple operation, followed by about 0.3 seconds to perform the FFT on loaded data.

This is annoying because I need to perform this operation more than 100 times and I know the current method is super inefficient

Is there a better way to approach this?

Might be worth asking Paige Martin how she handled this in her model.

2 Likes

Does calling `data.chunk(chunks={"time": -1})` produce an array with more than one chunk in the time dimension? Also, what are you using to do the fft? `xrft`?

Depending on how much data you’re dealing with, you may have more luck first saving an intermediate dataset with the chunking you need and then computing the ffts on that.

1 Like

Hmm doing the -1 chunking still leaves the spatial chunks there as circled below

The only way I’ve been able to make the FFT work is using the ‘load’ command - nothing else seems to work. Re-saving would also work but I’d expect that re-saving to disk would take at least as long as loading the data to memory right?

Ok so for those following along at home:

Using the .rechunk() method does work. BUT this can only be applied directly to dask arrays, not xarrays. It seems that the `chunks` option which xarray uses to wrap the dask chunking simply doesn’t do what you expect. However, the below shenanigans do work. Basically I’m overwriting the xarray data with a manually rechunked dask array

Unless I’m missing something here maybe I’ll raise an issue on xarray to point this out?

The time chunking will need to be done after the `open_mfdataset` step because you have separate files in time. Does calling

``````eslice.chunk(chunks={"time": -1})
``````

still not give you a single chunk in time?

1 Like

That works! Thanks so much - I had no idea about the difference. So the `chunks` argument will chunk each individual file as its loaded, but the `.chunk()` method is the equivalent of `.rechunk()` for dask arrays?

In the past I guess it’s confused me that `.rechunk()` doesn’t work on xarrays but I guess I forget about the `.chunk` method as being the xarray equivalent / wrapper

Yup, its most efficient to use the `chunks` argument when opening the data, where possible (see here). But this chunking is applied before the combine step in `open_mfdataset`. In your case, this means chunk sizes in time will be truncated at the length of the time dimension in the files you’re combining.

I think it’s probably a good idea to do all the chunking you can using the `chunks` argument in `open_mfdataset`, and only chunk time using the `.chunk()` method, e.g.,

``````eslice = xr.open_mfdataset(
paths,
chunks={"time": -1, "zi": 1, "xh": 500, etc},
etc
).chunk(chunks={"time": -1})
``````
3 Likes

Thanks Dougie! This makes a lot of sense. 12-15 mins of processing time down to 30s is a good effort

This seems so useful it would be good to “surface” this somehow. Probably the best way would be a COSIMA Recipe @ashjbarnes, assuming this is COSIMA data?

I took the liberty of sprinkling some tag-magic to help with folks finding this.