Combining intake datastores and missing files from custom experiment datastore

Hi Intake Team,
I have a couple of questions about intake datastores. I have run a regional OM3 model using the ACCESS-OM3 infrastructure, which automatically made intake_esm_ds.json files at the end of the experiment. I have two issues which I was hoping you may be able to help with:

  1. I ran the model in three different folders (slight parameter changes in each case, but same diagnostics and file structures). This means I have three datastores. Is it possible with the current infrastructure to merge three datastores in the notebook where you are analysing them? Alternatively, should I be remaking a datastore by puttting them all in one folder and using the script at the end of Building Intake-ESM datastores of ACCESS model output — COSIMA Cookbook documentation? Can this script pick up output in symbolic links?
  2. Some of the diagnostics didn’t get picked up by the esm datastore making script. I think this is because they were not named with the access-om3.mom6.XX.nc convention, instead they are called something like ocean_daily_z.nc. Do you have any suggestions that might allow these files to be added to the datastore?

Thank you!
Claire

1 Like

Hi Claire,

  1. Currently, it’s not straightforward to merge two datastores, at least not to my knowledge, although I don’t think there’s a good reason for this - I’m pretty sure the functionality exists, it’s just that it’s not that well exposed. I’ll do some digging and you get a snippet of code that you should be able to combine the three, because I’m pretty sure it should be easy. It strikes me that this is probably something we’d like to expose utility tools for too - could you open an issue on the catalog repo and I’ll get to it as soon as I get the chance?

  2. This is an irritating and somewhat fundamental issue with the way our builders currently scan directories to build esm-datastores. @marc.white has done some fantastic work to fix this that we’re just putting the final touches on, but for now if you can point me to the directories where these files live I can update the filename conventions in the builder so that it’ll pick up these files.

Cheers, Charles

1 Like

Okay, this code should work to combine your datastores:

from intake_esm.core import esm_datastore
import pandas as pd

def combine_datastores(*esm_datastores) -> esm_datastore:
    """
    Takes a bunch of esm_datastores and returns a combined esm_datastore.

    Still very much a first draft but should work for consistently shaped
    datastores. 

    Usage: combine_datastores(esm_datastore_1, esm_datastore_2, ... , esm_datastore_n)

    To get your datastores:
    ```python
    import intake
    esm_ds1 = intake.open_esm_datastore(
        "/path/to/datastore1.json",
        columns_with_iterables = ['variable'], # Probably
    )
    """
    esmcat_dict = esm_datastores[0].esmcat.dict()
    df_list = [esm_datastore.df for esm_datastore in esm_datastores]
    df = pd.concat(df_list)

    return esm_datastore({'esmcat': esmcat_dict, 'df': df})

Once you have a combined datastore, you can save it with

esm_datastore.serialize(
    name = "experiment_datastore", # saves to `experiment_datastore.json`
    directory = "/path/to/a/dir", # Puts it in directory `/path/to/a/dir`
   catalog_type='file'
)

Let me know if you have any issues & I’ll do my best to help!

Cheers, Charles

2 Likes

Awesome, thanks so much @CharlesTurner, that script combined the datastores!

On the files that don’t make it in - I’ve moved all the output files to /g/data/ol01/cy8964/access-om3/archive/8km_jra_ryf_obc_Charrassin/, the ones that don’t make it are ocean_month_z.nc and ocean_month.nc.

Thanks!