Intake vs mfdataset

JuliaN · 29 April 2025 23:33

I have been trying to use intake to do some analysis with @KZCurtin using the 1/20th degree PanAntarctic model. I don’t personally use intake - I find that using xr.open_mfdataset with a preprocessing function works smoothly 99.9% of the time - but intake seems a cool tool with potential.

Anyway, we were trying to do a temperature average of the first 500m in the Antarctic shelf, and using intake (with and without preprocessing, with the kwargs suggested in the cosima-recipes ) the kernel crashes. Doing the same thing with mfdataset and preprocessing works just fine.

I am not looking for a specific solution, I am personally very happy continuing to use mfdataset. But I did spend some time yesterday trying to make intake work. I am no dask wizard, and this all would perhaps be solved by doing some smart chunking magic, but I don’t think it is realistic to expect every intake user to be a dask expert, specially since I understand intake is aiming to provide a high-level way of opening datasets.

I have made a notebook showing this issue (Intake_vs_mfdatarray.ipynb · GitHub). I’ve used 28 cores on the normalbw queue.

clairecarouge · 30 April 2025 01:24

@JuliaN Thanks for sending the notebook to illustrate the issue. @dougiesquire will look into it and will let you know what he finds so others can find the information if they have the same issue.

clairecarouge · 30 April 2025 04:10

@dougiesquire See @CharlesTurner 's reply here, he says intake ESM has a problem in analysis3-25.04, I have no idea if this is linked to what is happening here, but it might be worth trying with a previous version first.

dougiesquire · 5 May 2025 02:05

Sorry for taking so long to get to this @JuliaN - I was sick last week.

I think this difference is due to the files for dates prior to 2000, which you are including when using Intake-ESM, but not when you open directly with xr.open_mfdataset. Are these files formatted differently that the later dates?

If I restrict the Intake-esm approach to files post 2000, I can run your analysis without issue:

esmds = intake.open_esm_datastore("~/catalog/panant.json", columns_with_iterables=["variable"])

shelf_mask = xr.open_dataset('/g/data/ik11/grids/Antarctic_slope_contour_1000m_MOM6_005deg.nc')['contour_masked_above']
area = esmds.search(variable='areacello', path=[".*output021.*"]).to_dask()['areacello']

def preprocess(ds):
    return ds.sel(yh=shelf_mask['yh'], z_l=slice(None,500))


ds = esmds.search(
    variable=["thetao", "volcello"],
    frequency="1mon",
    filename=".*ocean_month_z",
    start_date="^2.*" # 2000s+ - this will be easier/nicer when date-based searching is implemented
).to_dask(
    xarray_open_kwargs=dict(
        chunks="auto"
    ),
    preprocess=preprocess
)

temp = ds['thetao']
volc = ds['volcello']
thik = ds['volcello']/area

temp.isel(z_l=0, time=0).where(shelf_mask==0).plot()

JuliaN · 5 May 2025 03:30

Hi Dougie, thanks for testing this out! I am not aware of any format differences prior or post 2000s. I am only interested in plotting the last 10 years of experiment, hence why I restrict with mfdataset. I was not aware of the possibility of using start_date with intake! That’s super useful, thank you.

Is it possible that it now runs because it is opening 50% of the data? Another question, what does ^ in start_date mean? How would you open everything post 2003? If this is all work in progress no worries! Like I’m said, not really after solutions, just thought I’d flag something for potential improvements.

Minute-to-minute update: I have been running intake for +30min and its still not finished (mfdataset took 11min)

dougiesquire · 5 May 2025 04:11

Are you using my code above, specifically:

    xarray_open_kwargs=dict(
        chunks="auto"
    )

as you were doing with open_mfdataset?

dougiesquire · 5 May 2025 04:23

No, I think there is something more sinister going on. If I try to open years prior to 2000, even just using open_mfdataset, I get killed dask workers. I.e. changing in your notebook:

- paths = np.sort(glob.glob('/g/data/ol01/outputs/mom6-panan/panant-005-zstar-ACCESSyr2/outpu*/*.ocean_month_z.nc'))[-120:]
+ paths = np.sort(glob.glob('/g/data/ol01/outputs/mom6-panan/panant-005-zstar-ACCESSyr2/outpu*/*.ocean_month_z.nc'))[:-120]

This is a horrible hack using regex to only get start_dates that start with 2. The ^ matches on the start of the line. As I commented in the code snippet, there is work underway (by @CharlesTurner) to provide proper support for date-based searches.

JuliaN · 5 May 2025 04:49

Ok I was not using your kwargs, looks like switching what I was using from the cosima recipes to ‘auto’ chunks works! It takes the same time as mfdataset yeay! I’ll try to remember to pass this as kwargs - @KZCurtin this might be useful

I can’t find differences between the files in those two periods, sinister indeed! I also don’t know who ran those experiments so not sure who to ask.

adele-morrison · 5 May 2025 05:01

@dougiesquire should we update the COSIMA Tutorial to use chunks="auto" as the default suggestion if this is crucial? Or are there times when this wouldn’t be useful?

And @JuliaN is this preprocess method faster than just opening all the data with Intake and selecting the region after? Is so, should we also put this as the default suggestion in the COSIMA Tutorial? Most COSIMA people are opening high res data, so will have similar issues to this.

JuliaN · 5 May 2025 05:22

In this case, without preprocessing the kernel crashes. I think it would be super useful to add it to COSIMA, but potentially it can wait until start_time update is complete and incorporate that too?

adele-morrison · 5 May 2025 05:36

Cool, @JuliaN given you know more than me about all this, maybe do you want to open a COSIMA Recipe issue, linking to this post, to remind us to make these changes to the Tutorial at some stage?

jemmajeffree · 5 May 2025 05:58

Just to clarify, chunks='auto' uses the chunking from the netcdfs and the default chunks=None uses whole files as individual chunks?
Is it possible to change the intake default in xp65? Happy to do so if given a few pointers on where to make the modification

dougiesquire · 5 May 2025 06:02

To clarify, I added chunks="auto" to @JuliaN’s Intake-ESM example in this post because that is the chunking @JuliaN was using with open_mfdataset. @JuliaN’s original example used chunks="auto" for open_mfdataset, but did not specify chunking with Intake, which is not a like-for-like comparison between the two.

As always, the best chunking strategy depends on the analysis being done. That said, chunks="auto" is almost certainly a better starting point than the default NetCDF chunking on disk, so adding chunks="auto" to the COSIMA Tutorial sounds sensible to me. @JuliaN or @jemmajeffree, would you like to do this? UPDATE: see this.

dougiesquire · 5 May 2025 06:09

Not quite. From the docs:

chunks=“auto” will use dask auto chunking taking into account the engine preferred chunks.

chunks=None skips using dask, which is generally faster for small arrays.

chunks=-1 loads the data with dask using a single chunk for all arrays.

chunks={} loads the data with dask using the engine’s preferred chunk size, generally identical to the format’s chunk size. If not available, a single chunk for all arrays.

dougiesquire · 5 May 2025 06:23

Actually, I just remembered I tried in the past to change the default in Intake-ESM to chunks="auto" and it doesn’t work well in the general case because dask auto chunking does not work with object dtype (e.g. cftime arrays).

jemmajeffree · 5 May 2025 21:55

ohhh, yup.
I don’t like that we have example code that, on some occasions, doesn’t work at all unless you add chunking keywords. Seems like a lot of overhead/steep learning curve for someone trying to get started
Would chunks='auto', decode_times=False and then ds.decode_cf() work reliably? Or is that overcomplicating things?

dougiesquire · 6 May 2025 04:22

I think this is, to some extent, unavoidable. Optimum chunking is very dataset/analysis-dependent. No existing xarray default will suit everything.

For example, xarray.open_[mf]dataset defaults to chunks=None, which skips using dask array altogether. This obviously won’t suit large datasets.

As I said above, having chunks='auto' the default in Intake-ESM would definitely be better than the current default of chunks={}. But I don’t know how safe it is to assume that cftime arrays are the only object dtypes one might encounter.

CharlesTurner · 6 May 2025 23:49

Just flagging - the intake to dask efficiently notebook in cosma-recipes contains some info on chunking, & I’ve currently got a PR open which adds a bit more detail to that & makes it clearer chunking is involved.

The recommendation in that notebook is generally to start with chunks = 'auto' and go from there - but like Dougie says, it’s kinda fundamentally impossible to give a perfect default chunking.

In addition, from conda/analysis3-25.05 onwards, the access_intake_utils package can be used to inspect the disk chunking & adjust user defined chunks to respect disk chunking: see here. The tooling will work on a list of files & intake catalogues alike, & should work on an xarray dataset/dataarray if the internal file references are stored in the fashion it expects (not guaranteed though).

I’ll get that example folded into the PR because AFAIK xarray doesn’t expose any information about disk chunks, nor make it clear they are separate to dask chunks.

Topic		Replies	Views
ACCESS-NRI Intake catalogue for CM2-025 simulation Earth System help , cosima , model-evaluation , inscope	19	109	24 February 2025
Dask remove time chunks for Fourier transforms Technical dask , xarray , chunking	9	371	30 August 2023
Making Intake Datastore for panantarctic COSIMA help , inscope , mom6 , intake	18	161	7 May 2025
Xarray warnings while loading data using cosima cookbook Technical python , help , cosima	16	451	2 December 2024
New CMS blog - preprocess in open_mfdataset Technical python , xarray	3	232	17 November 2024

Intake vs mfdataset

Related topics