Intake-ESM and 3hr data

I am trying to identify which CMIP6 models at NCI have 3hr data, without manually trawling through them all.

Something like this:

            import intake
            cmip6 = intake.open_esm_datastore("/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json")
             values_dict = cmip6.unique()

models_list = values_dict.source_id

but take a subset for testing

small_list = [‘EC-Earth3’, ‘CESM2’, ‘GFDL-ESM2M’, ‘GFDL-ESM4’, ‘MIROC6’, ‘NorESM2-MM’, ‘CMCC-ESM2’]

variables of interest

three_hr_data = [‘huss’, ‘tas’, ‘uas’, ‘vas’, ‘ps’, ‘pr’]

testing with one model

cat_subset = cmip6.search(
source_id=[‘EC-Earth3’],
experiment_id=[“ssp370”],
table_id=“3hr”,
variable_id=“huss”,
grid_label=[“gn”, ‘gr’],
)

cat_subset

 cmip6-oi10 catalog with 5 dataset(s) from 430 asset(s):
etc

dset_dict = cat_subset.to_dataset_dict(
xarray_open_kwargs={“consolidated”: True, “decode_times”: True, “use_cftime”: True}
)

Which then fails with this error:

ESMDataSourceError: Failed to load dataset with key=‘f.ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp370.r6i1p1f1.3hrPt.atmos.3hr.huss.gr.v20200201’
You can use cat['f.ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp370.r6i1p1f1.3hrPt.atmos.3hr.huss.gr.v20200201'].df to inspect the assets/files for this key.

Why does it say ‘3hrPt.atmos.3hr’ ? Not what I was expecting. Actual path is
/g/data/oi10/replicas/CMIP6/ScenarioMIP/EC-Earth-Consortium/EC-Earth3/ssp370/r1i1p1f1/3hr/huss/gr/v20200310

Can you suggest how to get this to work in a useful way? At the end of the day, I want a list of paths similar to the above, for all the models with 3hr data in the three_hr_data list. Many thanks.

FYI - we’re working on this over here, for those interested: Using Intake-ESM · Issue #5 · shared-climate-data-problems/CMIP-data-problems · GitHub

@rb4844 did you include gdata/oi10 in the list of storage paths when you started your job (presumably on the ARE)? My guess is that’s why you can’t open the data.

Regarding the 3hrPt in the key name (which I assume is what you were surprised to see), this is the entry in the “frequency” column for these assets. If you think this might be wrong, I’d suggest reaching out to the NCI helpdesk since NCI created and manage these datastores.

Hi Dougie,

Yes, I have gdata/oi10. Environment/module on ARE with /g/data/hh5/public/modules but I assume this is ok as import intake works.

Storage is

gdata/dk7+gdata/dk92+gdata/p73+gdata/rn45+gdata/hh5+gdata/xv83+gdata/oi10+gdata/r87+gdata/fs38+gdata/rr3+gdata/cj37+gdata/rt52+gdata/lp01

consolidated is not a valid kwarg for xarray.open_dataset with netcdf files. Removing that, I can open the data using your code:

import intake
from distributed import Client

client = Client(threads_per_worker=1)

cmip6 = intake.open_esm_datastore(
    "/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json"
)

cat_subset = cmip6.search(
    source_id=['EC-Earth3'],
    experiment_id=["ssp370"],
    table_id="3hr",
    variable_id="huss",
    grid_label=["gn", 'gr'],
)

dset_dict = cat_subset.to_dataset_dict(
    xarray_open_kwargs={"decode_times": True, "use_cftime": True}
)
1 Like

Awesome, that worked now. Now to figure out what the result means. … Many thanks

Any ideas @rb4844 why this wasn’t a solution? Using Intake-ESM · Issue #5 · shared-climate-data-problems/CMIP-data-problems · GitHub

Not really. There must be some error in my original, longer version of the code that I can’t see (not the first time that’s happened, lol). Using the simpler version does work, so next step is to develop that to loop over more models. Thanks for looking at this though, appreciated.

1 Like

no worries - always good to get a solution!