ACCESS-NRI Intake catalogue for CM2-025 simulation

Hi. I created an intake catalogue to read CM2-025 output following the instructions from the NRI quickstart guide. I believe it worked fine, the datastore is under /g/data/lg87/wgh581.

I have two questions when reading the data:

  1. When I load in data, let’s say 500 yr of global ocean temperature, like this: cat.search(variable="temp_global_ave").to_dask().temp_global_ave it sometimes runs, taking about 20-30 sec, and sometimes if gets stuck in loading the data. Restarting the kernel helps, but I’m not sure what’s different between when it works and when it doesn’t. Any ideas?

  2. I struggle to load in data from the atmospheric model, e.g., latent heat flux: cat.search(variable="fld_s03i234").to_dask() It gives me the following error message:

ValueError: Expected exactly one dataset. Received 12000 datasets. Please refine your search or use .to_dataset_dict().

I can add the .to_dataset_dict() but then it gets stuck in loading the data. The 120000 files are half daily, half monthly output; I’m interested in the monthly only. I assume this has something to do with how the model output is saved?

Apologies if this or something similar was discussed at today’s COSIMA training session, unfortunately, I couldn’t make it.

Many thanks

1 Like

Hi Wilma,

For your second question, it sounds like you are potentially overloading the interpreter with the daily files.

One way to resolve this would be to try:

>>> freqs = cat.unique().frequency
>>> print(freqs)
['1day, '1mon'] # Or similar
>>> cat.search(variable="fld_s03i234", frequency="1mon").to_dataset_dict()

Give that a try, and see if that resolves your issues.

You could also try refining the datastore search until you get a single dataset, by following a process something like

cat.search(variable="fld_s03i234").search(frequency="1mon").search(...)

If you look at the output each time, you can hopefully see one of the fields will help you to narrow down your search - playing around with this might help? Once you’re down to a single dataset, you can use .to_dask().

Regarding the first question, when you run something like cat.search(variable="temp_global_ave").to_dask().temp_global_ave, is the process simply hanging, or are you getting errors?

I’m not a member of lg87 but I’ll join & investigate - if you are getting errors though, it might be faster if you’re able to copy/paste them in here.

Normally when i get errors like with an impossibly large number of datasets, i’ve forgotten the columns_with_iterables=["variable"] argument to open_esm_datastore() ?

running .keys() or .keys_info() on the search can help. I think it should just be defined by filename.frequency ?

If @anton’s suggestion doesn’t work, it’s also possible that the datastore builder hasn’t correctly identified that these 12000 files comprise a single dataset. Are these output formatted in a different way than other output? Could you provide the path to one of these files?

Thanks for the replies!

  • cat.search(variable="temp_global_ave").to_dask().temp_global_ave seems to be just hanging, no errors.
  • I have used the columns_with_iterables=["variable"]
  • What would be the aim when going down to a single file? What information am I after? For context: I want to load one variable for many time steps. But my understanding is that I cannot give the time info as an argument.
  • I haven’t changed the format of the atmospheric output. (I have for the ocean but that seems to work.)
  • The model output is under /g/data/lg87/wgh581/cz861/history

What does:

cat.search(variable="fld_s03i234").keys()

return?

A single dataset in this case is all files which it makes sense to concatenate together. e.g. A dataset in a catalog could be all the files from a model component, on the same grid, with the same frequency containing the same variables.

It sounds like there could be three datasets for the fld_s03i234 variable to capture 12hour/daily/monthly as seperate data.

The .keys() argument gives a long list of files, starting with daily - I assume it will also go through monthly etc.

Below the output to cat.search(variable="fld_s03i234", frequency="1mon")

Looks like the builder hasn’t correctly identified that these files comprise a single dataset. @CharlesTurner is going to dig into this. Thanks for reporting @wghuneke!

1 Like

Thanks for looking into it!

Curiously, I’m actually getting a slightly different datastore out - but it has similar issues, so we’ll see if fixing the issues with my reproduction fixes yours too.

I generated the datastore using

(venv) Singularity> build-esm-datastore --builder AccessCm2Builder --expt-dir /g/data/lg87/wgh581 --cat-dir ~/wilma --builder-kwargs ensemble=False
Generating esm-datastore for /g/data/lg87/wgh581
Building esm-datastore...
Sucessfully built esm-datastore!
Saving esm-datastore to /home/189/ct1163/wilma
/home/189/ct1163/access-nri-intake-catalog/bin/venv/lib/python3.11/site-packages/intake_esm/cat.py:187: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  data = self.dict().copy()
Successfully wrote ESM catalog json file to: file:///home/189/ct1163/wilma/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /home/189/ct1163/wilma/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold-catalog-entry' for help on how to do so.
To open the datastore, run `intake.open_esm_datastore('/home/189/ct1163/wilma/experiment_datastore.json', columns_with_iterables=['variable'])` in a Python session.

N.B: I found an issue in build-esm-datastore related to builder kwargs, so there’ll be a bugfix going in. Anyhow, after running that, I opened the datastore in a notebook:

Similarly, thousands of datasets - and it seems in my case to be differentiated on file_id:

>>> esm_ds.unique().file_id
['iceh_d_XXXX_XX',
 'iceh_m_XXXX_XX',
 'ocean_ym_0001_01',
 'ocean_ym_0001_07',
 'ocean_ym_0002_01',
...
'ocean_ym_0499_01',
 'ocean_ym_0499_07',
 ...]

It looks an awful lot like these aren’t being redacted correctly - ie. numbers changed from, eg. 0001_01 to XXXX_XX, and this is causing them not to be correctly identified as the same dataset.

I’m gonna leave it here for now, but I’ll look into it first thing monday - hopefully not too complicated a fix.

1 Like