Hi. I created an intake catalogue to read CM2-025 output following the instructions from the NRI quickstart guide. I believe it worked fine, the datastore is under /g/data/lg87/wgh581.
I have two questions when reading the data:
When I load in data, let’s say 500 yr of global ocean temperature, like this: cat.search(variable="temp_global_ave").to_dask().temp_global_ave it sometimes runs, taking about 20-30 sec, and sometimes if gets stuck in loading the data. Restarting the kernel helps, but I’m not sure what’s different between when it works and when it doesn’t. Any ideas?
I struggle to load in data from the atmospheric model, e.g., latent heat flux: cat.search(variable="fld_s03i234").to_dask() It gives me the following error message:
ValueError: Expected exactly one dataset. Received 12000 datasets. Please refine your search or use .to_dataset_dict().
I can add the .to_dataset_dict() but then it gets stuck in loading the data. The 120000 files are half daily, half monthly output; I’m interested in the monthly only. I assume this has something to do with how the model output is saved?
Apologies if this or something similar was discussed at today’s COSIMA training session, unfortunately, I couldn’t make it.
If you look at the output each time, you can hopefully see one of the fields will help you to narrow down your search - playing around with this might help? Once you’re down to a single dataset, you can use .to_dask().
Regarding the first question, when you run something like cat.search(variable="temp_global_ave").to_dask().temp_global_ave, is the process simply hanging, or are you getting errors?
I’m not a member of lg87 but I’ll join & investigate - if you are getting errors though, it might be faster if you’re able to copy/paste them in here.
Normally when i get errors like with an impossibly large number of datasets, i’ve forgotten the columns_with_iterables=["variable"] argument to open_esm_datastore() ?
running .keys() or .keys_info() on the search can help. I think it should just be defined by filename.frequency ?
If @anton’s suggestion doesn’t work, it’s also possible that the datastore builder hasn’t correctly identified that these 12000 files comprise a single dataset. Are these output formatted in a different way than other output? Could you provide the path to one of these files?
cat.search(variable="temp_global_ave").to_dask().temp_global_ave seems to be just hanging, no errors.
I have used the columns_with_iterables=["variable"]
What would be the aim when going down to a single file? What information am I after? For context: I want to load one variable for many time steps. But my understanding is that I cannot give the time info as an argument.
I haven’t changed the format of the atmospheric output. (I have for the ocean but that seems to work.)
The model output is under /g/data/lg87/wgh581/cz861/history
A single dataset in this case is all files which it makes sense to concatenate together. e.g. A dataset in a catalog could be all the files from a model component, on the same grid, with the same frequency containing the same variables.
It sounds like there could be three datasets for the fld_s03i234 variable to capture 12hour/daily/monthly as seperate data.
Looks like the builder hasn’t correctly identified that these files comprise a single dataset. @CharlesTurner is going to dig into this. Thanks for reporting @wghuneke!
Curiously, I’m actually getting a slightly different datastore out - but it has similar issues, so we’ll see if fixing the issues with my reproduction fixes yours too.
I generated the datastore using
(venv) Singularity> build-esm-datastore --builder AccessCm2Builder --expt-dir /g/data/lg87/wgh581 --cat-dir ~/wilma --builder-kwargs ensemble=False
Generating esm-datastore for /g/data/lg87/wgh581
Building esm-datastore...
Sucessfully built esm-datastore!
Saving esm-datastore to /home/189/ct1163/wilma
/home/189/ct1163/access-nri-intake-catalog/bin/venv/lib/python3.11/site-packages/intake_esm/cat.py:187: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
data = self.dict().copy()
Successfully wrote ESM catalog json file to: file:///home/189/ct1163/wilma/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /home/189/ct1163/wilma/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold-catalog-entry' for help on how to do so.
To open the datastore, run `intake.open_esm_datastore('/home/189/ct1163/wilma/experiment_datastore.json', columns_with_iterables=['variable'])` in a Python session.
N.B: I found an issue in build-esm-datastore related to builder kwargs, so there’ll be a bugfix going in. Anyhow, after running that, I opened the datastore in a notebook:
It looks an awful lot like these aren’t being redacted correctly - ie. numbers changed from, eg. 0001_01 to XXXX_XX, and this is causing them not to be correctly identified as the same dataset.
I’m gonna leave it here for now, but I’ll look into it first thing monday - hopefully not too complicated a fix.