ACCESS-NRI Intake catalogue for CM2-025 simulation

Hi. I created an intake catalogue to read CM2-025 output following the instructions from the NRI quickstart guide. I believe it worked fine, the datastore is under /g/data/lg87/wgh581.

I have two questions when reading the data:

  1. When I load in data, let’s say 500 yr of global ocean temperature, like this: cat.search(variable="temp_global_ave").to_dask().temp_global_ave it sometimes runs, taking about 20-30 sec, and sometimes if gets stuck in loading the data. Restarting the kernel helps, but I’m not sure what’s different between when it works and when it doesn’t. Any ideas?

  2. I struggle to load in data from the atmospheric model, e.g., latent heat flux: cat.search(variable="fld_s03i234").to_dask() It gives me the following error message:

ValueError: Expected exactly one dataset. Received 12000 datasets. Please refine your search or use .to_dataset_dict().

I can add the .to_dataset_dict() but then it gets stuck in loading the data. The 120000 files are half daily, half monthly output; I’m interested in the monthly only. I assume this has something to do with how the model output is saved?

Apologies if this or something similar was discussed at today’s COSIMA training session, unfortunately, I couldn’t make it.

Many thanks

1 Like

Hi Wilma,

For your second question, it sounds like you are potentially overloading the interpreter with the daily files.

One way to resolve this would be to try:

>>> freqs = cat.unique().frequency
>>> print(freqs)
['1day, '1mon'] # Or similar
>>> cat.search(variable="fld_s03i234", frequency="1mon").to_dataset_dict()

Give that a try, and see if that resolves your issues.

You could also try refining the datastore search until you get a single dataset, by following a process something like

cat.search(variable="fld_s03i234").search(frequency="1mon").search(...)

If you look at the output each time, you can hopefully see one of the fields will help you to narrow down your search - playing around with this might help? Once you’re down to a single dataset, you can use .to_dask().

Regarding the first question, when you run something like cat.search(variable="temp_global_ave").to_dask().temp_global_ave, is the process simply hanging, or are you getting errors?

I’m not a member of lg87 but I’ll join & investigate - if you are getting errors though, it might be faster if you’re able to copy/paste them in here.

Normally when i get errors like with an impossibly large number of datasets, i’ve forgotten the columns_with_iterables=["variable"] argument to open_esm_datastore() ?

running .keys() or .keys_info() on the search can help. I think it should just be defined by filename.frequency ?

If @anton’s suggestion doesn’t work, it’s also possible that the datastore builder hasn’t correctly identified that these 12000 files comprise a single dataset. Are these output formatted in a different way than other output? Could you provide the path to one of these files?

Thanks for the replies!

  • cat.search(variable="temp_global_ave").to_dask().temp_global_ave seems to be just hanging, no errors.
  • I have used the columns_with_iterables=["variable"]
  • What would be the aim when going down to a single file? What information am I after? For context: I want to load one variable for many time steps. But my understanding is that I cannot give the time info as an argument.
  • I haven’t changed the format of the atmospheric output. (I have for the ocean but that seems to work.)
  • The model output is under /g/data/lg87/wgh581/cz861/history

What does:

cat.search(variable="fld_s03i234").keys()

return?

A single dataset in this case is all files which it makes sense to concatenate together. e.g. A dataset in a catalog could be all the files from a model component, on the same grid, with the same frequency containing the same variables.

It sounds like there could be three datasets for the fld_s03i234 variable to capture 12hour/daily/monthly as seperate data.

The .keys() argument gives a long list of files, starting with daily - I assume it will also go through monthly etc.

Below the output to cat.search(variable="fld_s03i234", frequency="1mon")

Looks like the builder hasn’t correctly identified that these files comprise a single dataset. @CharlesTurner is going to dig into this. Thanks for reporting @wghuneke!

1 Like

Thanks for looking into it!

Curiously, I’m actually getting a slightly different datastore out - but it has similar issues, so we’ll see if fixing the issues with my reproduction fixes yours too.

I generated the datastore using

(venv) Singularity> build-esm-datastore --builder AccessCm2Builder --expt-dir /g/data/lg87/wgh581 --cat-dir ~/wilma --builder-kwargs ensemble=False
Generating esm-datastore for /g/data/lg87/wgh581
Building esm-datastore...
Sucessfully built esm-datastore!
Saving esm-datastore to /home/189/ct1163/wilma
/home/189/ct1163/access-nri-intake-catalog/bin/venv/lib/python3.11/site-packages/intake_esm/cat.py:187: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  data = self.dict().copy()
Successfully wrote ESM catalog json file to: file:///home/189/ct1163/wilma/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /home/189/ct1163/wilma/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold-catalog-entry' for help on how to do so.
To open the datastore, run `intake.open_esm_datastore('/home/189/ct1163/wilma/experiment_datastore.json', columns_with_iterables=['variable'])` in a Python session.

N.B: I found an issue in build-esm-datastore related to builder kwargs, so there’ll be a bugfix going in. Anyhow, after running that, I opened the datastore in a notebook:

Similarly, thousands of datasets - and it seems in my case to be differentiated on file_id:

>>> esm_ds.unique().file_id
['iceh_d_XXXX_XX',
 'iceh_m_XXXX_XX',
 'ocean_ym_0001_01',
 'ocean_ym_0001_07',
 'ocean_ym_0002_01',
...
'ocean_ym_0499_01',
 'ocean_ym_0499_07',
 ...]

It looks an awful lot like these aren’t being redacted correctly - ie. numbers changed from, eg. 0001_01 to XXXX_XX, and this is causing them not to be correctly identified as the same dataset.

I’m gonna leave it here for now, but I’ll look into it first thing monday - hopefully not too complicated a fix.

1 Like

Okay, it looks like the issue is that the AccessCm2Builder builder doesn’t contain a pattern to match ocean_ym_YYYY_MM.

Is anyone able to confirm for me that this is a standard filename pattern that it ought to contain? If so I’ll add it to the builder - then @wghuneke I can paste a block of code in here that will rebuild your datastore with the correct dataset groupings.

1 Like

Thank you for looking into this.

I changed the output for this specific model run - good to know that the file naming is relevant. It would be great if you could add the file names to the code.

It is interesting though that I can load the ocean data (where the issue seems to be) with the catalogue, but not the atmospheric data.

I think that is due to the default nesting depth - builders will recurse down a certain number of subdirectories - currently 3.

Atmospheric output for your experiment here is in /g/data/lg87/wgh581/cz861/history/atm/netCDF/, whereas ocean data is in /g/data/lg87/wgh581/cz861/history/ocn/.

Based on the location of the datastore, I’m assuming the builder is trying to recurse down 4 for atmosphere - but only 3 for ocean.

I rebuilt the dataset as follows:

$ build-esm-datastore --builder AccessCm2Builder --expt-dir /g/data/lg87/wgh581/cz861 --cat-dir ~/wilma --builder-kwargs ensemble=False

which, after rebuilding, now includes the realm:

>>> import intake 
>>> intake.open_esm_datastore(
    '/home/189/ct1163/wilma/experiment_datastore.json',
    columns_with_iterables=['variable']
).unique().realm
['atmos', 'seaIce', 'ocean']

This is something I think could be better signposted, and perhaps user configurable.

1 Like

Great, thanks!

I will try to build my own version. I did the first version on ARE using this builder:
from access_nri_intake.source.builders import AccessCm2Builder

Is it updated already?

No, unfortunately the builder that you can access that way won’t update until we release a new version of the package. There are a couple of routes we can go from here (in order of decreasing difficulty (but potentially also decreasing utility, and probably more waiting)):

  1. Update the builder for yourself: you could clone the repo and add the necessary patterns to the builder to get it to work for your dataset. This might sound a bit scary (depending on your level of Python experience), but I can give you some steps to follow along so you could add the patterns you need? If you’d like to follow this route, once the datastore is built, you’d be able use it as any other datastore.
  • Pros: Fastest, most control
  • Cons: Hardest
  1. I can create a branch for you with what looks like (to me) the correct patterns for your datastore, and build a datastore using those patterns - which you’d be able to use as any other datastore.
  • Pros: Much easier than 1
  • Cons: If there are issues with the datastore, iterating to a correct one will take more time - you’ll have to tell me what’s wrong and wait for me to fix it for you
  1. Wait for the next release - which, given we know about this issue, should cover your use case.
  • Pros: Easiest
  • Cons: Very slow compared to 1, 2.

Let me know what suits!

Thanks, I’ll DM you but have accepted the comment on nesting depth as the solution.

1 Like