Making Intake Datastore for panantarctic

Hello, I am a new Honours student at the Australian National University. I’m trying to build a new intake datastore for a new panantarctic simulation. I have been trying to follow this tutorial: cosima-recipes/Tutorials/Make_Your_Own_Intake_Datastore.ipynb at main · COSIMA/cosima-recipes · GitHub , but none of the builder options have been working for my data (found here: /g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/). Any help would be much appreciated! Thanks.

2 Likes

Hi Katja

@CharlesTurner and @marc.white have done most of the work to make a builder for these configurations but they are not in the hh5 conda analysis environment.

If panant-01-zstar-v13 is ok for you, then you can open it seperately like this.

from intake import open_esm_datastore

datastore = open_esm_datastore(
    '/g/data/xp65/public/apps/access-nri-intake-catalog/v2024-12-10/source/panant-01-zstar-v13.json',
    columns_with_iterables=[
            "variable",
            "variable_long_name",
            "variable_standard_name",
            "variable_cell_methods",
            "variable_units",
    ]
)

Otherwise if you need panant-01-zstar-ACCESSyr2, you could try the pre-release of the new builders. Be warned there may still be issues with the pre-release environment though. To do this, in ARE set Module directories to /g/data/xp65/public/modules and Modules to conda/analysis3-25.01 and then make your datastore. I would then switch back to conda/analysis-24.04 in /g/data/hh5 for analysis as there are some issues handling netcdf files in the pre-release environment to be resolved. … See below - Charles method is better :slight_smile:

1 Like

I’ll take a look at this when Gadi comes back online - I’m assuming the standard Mom6 builder that Marc built should work for this.

@KZCurtin, as of right now you will need to install the access-nri-intake-catalog package directly into a virtual environment in order to get access to the features that are currently unreleased (but tested).

To do so (NB. I’m writing this during Gadi downtime so I’m going from memory, these steps might be slightly wrong - I’ll double check once it’s back up) :

  1. SSH into gadi, load the conda/analysis3 environment, and create a new virtual env:
user@local_machine $ ssh gadi 
user@gadi $ mkdir catalog_dir && cd catalog_dir # Change this to whatever you like/seems sensible to you 
user@gadi $ module load conda/analysis3
user@gadi $ python -m venv venv 
user@gadi $ source venv/bin/activate
user@gadi (venv) $ pip install git+https://github.com/ACCESS-NRI/access-nri-intake-catalog#egg=access_nri_intake
user@gadi (venv) $ build-esm-datastore --builder Mom6Builder --expt-dir  /g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/ --cat-dir .

All being well, this should generate an esm-datastore locally which you can then use - intake.open_esm_datastore("$YOUR_HOME_DIR/catalog_dir/experiment_datastore.json", columns_with_iterables=["variable", "variable_standard_name","variable_cell_methods","variable_units",],)
should open the datastore from a regular ARE instance with no more faffing around with virtual environments.

I’ll come back and double check/update this when Gadi is back up - hopefully later today or early tomorrow.

1 Like

Just flagging that I’ve opened an issue over at COSIMA recipes to incorporate the solution we find here into the existing COSIMA Make_Your_Own_Intake_Datastore tutorial in case anyone has time to update that.

1 Like

I’m planning to do that as soon as we get this feature into release - I’ll cross link the issues so it doesn’t get lost - thanks for opening it!

2 Likes

@KZCurtin - I’ve just tested the solution I posted above and it should work out of the box for you - the output of build-esm-datastore should additionally give you some instructions on how to open the built datastore, eg for me:

$ build-esm-datastore --builder Mom6Builder --expt-dir  /g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/ --cat-dir .
...
Sucessfully built esm-datastore!
Saving esm-datastore to /home/189/ct1163/catalog_dir
/home/189/ct1163/catalog_dir/venv/lib/python3.11/site-packages/intake_esm/cat.py:186: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  data = self.dict().copy()
Successfully wrote ESM catalog json file to: file:///home/189/ct1163/catalog_dir/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /home/189/ct1163/catalog_dir/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold-catalog-entry' for help on how to do so.
To open the datastore, run `intake.open_esm_datastore('/home/189/ct1163/catalog_dir/experiment_datastore.json', columns_with_iterables=['variable'])` in a Python session.

I’ve opened the datastore in a Jupyter notebook too & it all looks correct. Let me know if you have any issues!

2 Likes

Great, thank you all so much for the help!

1 Like

This has worked for me, thank you so much Charles!

One question though: as new years are run of the experiment, do we need to do the build again? Or does it update automatically?

1 Like

Unfortunate, the esm-datastores are static files, so once they’re built, they don’t change. With that in mind though, we made the build-esm-datastore tool to be able to account for this - if you run the same command again, build-esm-datastore will:

  1. Check the files in the datastore against the files in the directory, and rebuild if there’s a mismatch - missing files, extra files, mismatched paths, broken specs, etc. It should be very comprehensive.
  2. Check the hashes (basically a short key which double checks if anything in the file itself has changed) to make sure the hashes of the files are the same & rebuild if any of them have changed, ie. the files in the datastore have changed.
  3. If all checks pass, the datastore doesn’t need rebuilding, and the tool will confirm this.
  4. If any of the checks fail, the datastore will be rebuilt.

Since we’re still in pre-release, this isn’t all documented nicely yet, but running (in this instance)

$ build-esm-datastore --builder Mom6Builder --expt-dir  /g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/ --cat-dir .
Datastore found in current directory, verifying datastore integrity...
Parsing experiment dir...
Experiment directory and datastore do not match (missing files from datastore). Datastore regeneration required...
Building esm-datastore...
...
Sucessfully built esm-datastore!
Saving esm-datastore to /home/189/ct1163/catalog_dir
/home/189/ct1163/catalog_dir/venv/lib/python3.11/site-packages/intake_esm/cat.py:186: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  data = self.dict().copy()
Successfully wrote ESM catalog json file to: file:///home/189/ct1163/catalog_dir/experiment_datastore.json
Hashing catalog to prevent unnecessary rebuilds.
This may take some time...
Catalog sucessfully hashed!
Datastore sucessfully written to /home/189/ct1163/catalog_dir/experiment_datastore.json!
Please note that this has not added the datastore to the access-nri-intake catalog.
To add to catalog, please run 'scaffold-catalog-entry' for help on how to do so.
To open the datastore, run `intake.open_esm_datastore('/home/189/ct1163/catalog_dir/experiment_datastore.json', columns_with_iterables=['variable'])` in a Python session.

I guess the model run has been extended over the last couple of days? If not I’ll need to double check the tool (like I said, still pre-release :sweat_smile: ).

Good luck, and if you have any issues/queries, please ask!

1 Like

Yep, that’s perfect! We are extending that run, so I’ll keep in mind the updating the datastore as I go along. Thank you!

@JuliaN, if you’re running with Payu, you could add a userscript to automatically update the datastore at the end of each run - see the docs here.

(Note, future ACCESS-NRI configuration releases will include this by default)

2 Likes

@KZCurtin, @JuliaN are you able to mark working solutions with the little checkbox so future users can find them more easily? Thanks!

I think only @KZCurtin can do it since she’s the one posting :slight_smile:

Done now!

1 Like

Hey I have a follow up question!

I’ve tried to add a second experiment to my database… and I’ve overwritten my original one. Is this because there’s something to fix in the metadata so that the builder knows it’s a different experiment? Or do I need to do a new build with a complete list of experiments everytime?

I also think I might have broken something because now I can’t open any variable.

import intake

catalog = intake.open_esm_datastore('/home/561/jn8053/catalog_dir/experiment_datastore.json', columns_with_iterables=['variable'])

catalog.search(variable = 'areacello', path = "*.output021*.").to_dask()

ESMDataSourceError: Failed to load dataset with key='XXXXXXXX_ocean_static.fx'

The experiment path is /g/data/g40/akm157/model_output/mom6-panan/panant-01-zstar-ssp126-MW-only and I know for sure that areacello is there.

Hey Julia,

As of right now, the build-esm-datastore tool can only handle a single experiment - so you’ll need a new esm-datastore for each experiment. Originally, we envisioned this tool working much like the build your own datastore notebook that Anton wrote, with the datastore living with the data itself - so as of right now, build-esm-datastore won’t let you configure the datastore name, hence the annoying overwrite of the previous datastore.

Hindsight is 20/20 though, and it’s probably more helpful for users to have something like this:

$ ls ~/esm_datastores
experiment_1.json
experiment_1.csv.gz
experiment_2.json
experiment_2.csv.gz
...

I’ve opened a ticket on the catalog to let you specify the datastore name, and I’ll reply in here once that functionality is available.

When it comes to having multiple experiments in a single datastore, this is unfortunately not something that can be done easily - an ESM Datastore is designed to be specific to a single datastore, and I think you would need a thing called a dataframe-catalog in order to concatenate multiple experiment datastores together - this is how the intake.cat.access_nri catalog works.

@dougiesquire might be able to weigh in here - we could probably build a tool to concatenate these together so users can make their own mini catalogues.

Either way:

  • I’ve opened a ticket to let users specify the datastore names, so I’ll let you know when that feature is available.
  • I’ve requested to join the g40 group, so I’ll see if I can work out what’s wrong with the datastore.

The ability to specify datastore names (and so stick multiple datastores in the same directory) has now been implemented. To do so, I’d recommend deleting the venv you would created by following the steps above, and then following the steps again - it should grab the update.

Hopefully the quickstart guide will be enough to get you going (and please do try that first - it helps us know whether its useful!), but if not:

user@local_machine $ ssh gadi 
user@gadi $ cd catalog_dir
user@gadi $ rm -fr venv
user@gadi $ module load conda/analysis3
user@gadi $ python -m venv venv 
user@gadi $ source venv/bin/activate
user@gadi (venv) $ pip install git+https://github.com/ACCESS-NRI/access-nri-intake-catalog#egg=access_nri_intake
user@gadi (venv) $ build-esm-datastore --builder Mom6Builder --expt-dir  /g/data/ik11/outputs/mom6-panan/panant-01-zstar-ACCESSyr2/ --cat-dir . --datastore-name panant_01_zstar_access_yr2
user@gadi (venv) $ build-esm-datastore --builder Mom6Builder --expt-dir  /g/data/g40/akm157/model_output/mom6-panan/panant-01-zstar-ssp126-MW-only --cat-dir . --datastore-name panant_01_zstar_ssp126

Should get your there