Using intake catalog to pull CMIP models that match multiple criteria

Hello,

I am trying to use @Paola-CMS intake catalog to pull multiple CMIP6 variables that match multiple criteria. My code is:

import intake
cat = intake.cat.nci
cmip6 = cat['esgf'].cmip6
# Let's select a subset passing the search() method some constraints
subset1 = cmip6.search(activity_id='DAMIP', experiment_id='hist-nat', table_id='Omon', variable_id='thetao')
subset2 = cmip6.search(activity_id='DAMIP', experiment_id='hist-aer', table_id='Omon', variable_id='thetao')
subset3 = cmip6.search(activity_id='DAMIP', experiment_id='hist-GHG', table_id='Omon', variable_id='thetao')
subset4 = cmip6.search(activity_id='CMIP', experiment_id='historical', table_id='Omon', variable_id='thetao')

Basically, I only want the models/ensemble members that match all of the above criteria (i.e. subset1 & subset2 & subset3 & subset4). In the historical activity, for example, I don’t want all the instances with thetao, but rather the ones that have the same ensemble member and model as the matched hist-nat, hist-aer and hist-GHG runs. Is there any hint how this can be done? Thank you!

@taimoorsohail this is already [possible in intake:

First simplify your query by using lists where you need to include more than one value for a query facet, as for experiment_id and activity_id (probably redundant)

then use the require_all_on=[‘source_id’] as in this case you want to select the models that have all the selected experiments.

subset = cmip6.search(require_all_on=[‘source_id’], experiment_id=['hist-nat, ‘historical’,…], …)

Thanks @Paola-CMS ! This is exactly what I wanted, and it almost works - but I am having a strange bug. If I follow your instructions and pull all models (and their associated ensemble members) which have both thetao and tas it works and gives me a lot of datasets:

import intake

cat = intake.cat.nci
cmip6 = cat['esgf'].cmip6
subset = cmip6.search(require_all_on=['source_id'], \
experiment_id=['hist-nat','hist-aer','hist-GHG','historical'], \
variable_id=['thetao','tas'])

But this includes daily and hourly fields. I just want monthly. So when I add table_id to this:

import intake

cat = intake.cat.nci
cmip6 = cat['esgf'].cmip6
subset = cmip6.search(require_all_on=['source_id'], \
experiment_id=['hist-nat','hist-aer','hist-GHG','historical'], \
table_id = ['Omon', 'Amon'], variable_id=['thetao','tas'])
subset

I get the error:

ValueError: Length of values (0) does not match length of index (12)

I know for a fact that there are models and ensemble members which match my search criteria (I have manually found them in fs38 and oi10) so not sure what this error is!

It is weird, for the moment you could bypass it by setting the search without table_id and adding it later to further filter the results:

subset = …
subset2 = subset.search(table_id=[‘Omon’, ‘Amon’])

I now also know why the other approach doesn’t work.
When you add the Omon, Amon constraints in the first query you are creating a “condition” which is impossible to satisfy, as it will try to return all possible combination of your constraints, it will return only models that have the following files:

(‘hist-nat’, ‘tas’, ‘Amon’), (‘hist-aer’, ‘thetao’, ‘Amon’), (‘hist-nat’, ‘thetao’, ‘Amon’), (‘hist-GHG’, ‘thetao’, ‘Amon’), (‘hist-nat’, ‘tas’, ‘Omon’), (‘hist-aer’, ‘thetao’, ‘Omon’), (‘historical’, ‘thetao’, ‘Amon’), (‘hist-nat’, ‘thetao’, ‘Omon’), (‘hist-aer’, ‘tas’, ‘Omon’), (‘hist-aer’, ‘tas’, ‘Amon’), (‘historical’, ‘tas’, ‘Amon’), (‘historical’, ‘thetao’, ‘Omon’), (‘hist-GHG’, ‘tas’, ‘Amon’), (‘historical’, ‘tas’, ‘Omon’), (‘hist-GHG’, ‘tas’, ‘Omon’), (‘hist-GHG’, ‘thetao’, ‘Omon’)}

As it cannot find tas in Omon of thetao in Amon it always returns zero results. Using the workaround I proposed before you avoid this issue.

Hi @Paola-CMS I’m having an error when trying to use the intake module:

import intake

cat = intake.cat.nci
# CMIP6 is included in `esgf` which is itself a catalogue
# so we are using list() again to see the sub-catalogues
cmip6 = cat['esgf'].cmip6

Throws up the error:

TypeError: __init__() missing 1 required positional argument: 'esmcol_obj'

This is new - previously the module was working fine…

Hi @taimoorsohail , Paola replied to the same issue on Slack for someone else and the answer is to use conda/analysis3-22.10 module. There was an update to intake and now the catalogue is only compatible with the latest intake version.

Thanks @clairecarouge ! Unfortunately even with conda/analysis3-22.10 I get the error TypeError: __init__() missing 1 required positional argument: 'obj'

I reproduced your error @taimoorsohail

Weirdly it works with conda/analysis3-22.07, for me anyway

Any ideas @Paola-CMS?

It’s unfortunately a change in the new intake-esm version that makes new catalogue incompatible with older versions so
so if you use the unstable you can only load the catalogue opening directly the new catalogue file:
cat = intake.open_catalog(‘/g/data/hh5/public/apps/nci-intake-catalogue/catalogue_new.yaml’)

using older conda env or an intake-esm <2022.09 version:
cat = intake.open_catalog(‘/g/data/hh5/public/apps/nci-intake-catalogue/catalogue.yaml’)
or
cat = intake.cat.nci

NB that the last might change soon and be the reversed so cat =intake.cat.nci will work only with new intake-esm versions at our next update.
For issues like this it might be better to put something in the helpdesk, or on github, as currently it’s impossible for me to check the forum with the same frequency.