Issues using CLeF on gadi to find CMIP6 runs

Description of request:

I’m new to the NCI systems, and want to find CMIP6 runs stored on gadi. Moreover, I’m not a Python user (my workflow typically uses CDO/NCO to preprocess data, then R to do any fancy analysis and statistical-based visualisation, and then NCL for any climate/geophysical visualisation). Thus I’m trying to use CLeF to find runs, as opposed to the ACCESS-NRI Catalogue. However, I’m having issues executing some commands.

Environment:

I’m running on the gadi login node for now, and following the instructions from this blog: Using CleF - Climate Finder to discover ESGF data at NCI — CLEX CMS Blog. That includes loading the analysis3 module:
module use /g/data3/hh5/public/modules
module load conda/analysis3

What executed and what actually results:

  1. As a first attempt, I want to find all cmip6 runs containing daily precip for experiment ssp585:
    clef cmip6 -v pr --frequency day -e ssp585
    That works fine.

  2. Then I want to narrow it down to only the first runs for each model. So I try using a wildcard like this:
    clef cmip6 -v pr --frequency day -e ssp585 -vl ‘r1*’
    That gives an error:
    ERROR: No matches found on ESGF, check at https://esgf.nci.org.au/search/esgf-nci?query=&type=File&distrib=True&replica=False&latest=True&project=CMIP6&experiment_id=ssp585&frequency=day&variable_id=pr&variant_label=r1*

  3. Another thing I want to do is find runs containing both daily precip and daily temp for ssp585. So I try this:
    clef cmip6 -v pr -v tas --frequency day -e ssp585 --and variable_id
    But this gives an error that there are too many results. Which is surprising since I thought searching on both precip and temp would, if anything, narrow down the results:
    ERROR: Too many results (19211), try limiting your search https://esgf.nci.org.au/search/esgf-nci?query=&type=File&distrib=True&replica=False&latest=True&project=CMIP6&experiment_id=ssp585&frequency=day&variable_id=pr&variable_id=tas

  4. Also, adding the --local option to the above command gives this error:
    /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.07/lib/python3.10/site-packages/clef/code.py:176: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.

  • df = pd.read_sql(r.selectable, con=session.connection())*
    ERROR Query must be a string unless using sqlalchemy.
    ERROR: ‘NoneType’ object has no attribute ‘itertuples’
  1. Finally, I noticed there are discrepancies in search results (when using simple commands that do at least work), depending on whether I load the analysis3 or analysis3-unstable module. So which one should I use?

Expected results:

I want to find all CMIP6 runs containing both daily precip and temp for experiment ssp585, and also only the runs that are the first run (i.e. ‘r1*’). So, naively, I guess I want to execute something like this:

clef --local cmip6 -v pr -v tas --frequency day -e ssp585 -vl ‘r1*’ --and variable_id

…but obviously I’m doing a few things wrong, so any advice appreciated thanks.

Additional info:

More generally, based on my NCI experience so far, I’m wondering if I should finally give in to learning Python? Especially since this seems more supported, and the recent ACCESS-NRI training day showcased what looked like very useful analysis tools (GitHub - ACCESS-NRI/training-day-2024-find-analyse-data: ACCESS NRI Workshop 2024: Find Datasets and Handle Large Model Output). But this depends on my supervisor willing to allow me time to do so, and also me ignoring years of codebase I’ve written in CDO/NCO/R/NCL. Furthermore, recent posts in the following topic leave me unsure of the best analysis tools to use: Analysing CMIP6 models in gadi using Python?.

Hi @pardeeppall, I’ve reached out to the CLeF developers for input, but I do have the following comments in the interim.

  1. My first thought is that the variant label flag (vl) does not support wildcard entries as expected, have you tried iterating through the desired vls to see if anything is returned at all? If wildcards are documented to work then I would suggest reaching out to the Clef developers and raising a bug report. Alternatively, you may need to work through the different vis available and filtering through them yourself outside of Clef.

  2. The good news is that this is actually returning something, but there is an upper limit on how many results can be returned. It seems to me that the simplest approach would be to petition the Clef developers to increase the result size. See raise search limit · Issue #56 · coecms/clef · GitHub for an early instance where this has occurred. I sense this will become an increasing issue as data volumes increase with each CMIP.

This appears to be a hardcoded value: clef/clef/cli.py at d7dd2983641e52a176e772a36372248ebb5812fe · coecms/clef · GitHub

  1. This one looks like a version issue, it might be time to update the pandas library in hh5. The warning can be ignored, but the ERROR points to an API change.

  2. It depends, analysis3-unstable is just that, unstable. The packages and config can change and on a dime, however, many of the latest versions and packages are placed here before they make it to the official release. I would suggest on the outset to see if using one or the other addresses any of your issues. Anything more detailed will have to come from the hh5 maintainers themselves I’m afraid.

Regarding your final questions, you will find that a lot of tools are moving towards Python, even NCL with the “pivot to Python”. That being said, I am a fan of using the right tool for the job and becoming familiar with whatever tools help to get the job done and avoid being overly prescriptive with programming languages. You needn’t disregard the years of CDO/NCO/R/NCL experience, it will be invaluable regardless!

I will let you know when we hear back from the Clef devs.

Cheers, Ben

@pardeeppall, some comments from the Clef people:

If you search clef in the forum there should be a post (in answer to Will Hobbs ) about it, that should answer some of these questions, for the results it gets for the remote searches, they are limits imposed by ESGF. For example, the “temp and precip” question, you need first to search for both remotely, the “AND” filter is applied by the tool on the results of this search. ESGF queries don’t have such a filter included.

If I am reading this correctly, you may need to make multiple searches and handle the filtering on your end to achieve what you are looking for.

I assume the reply in question is this one?

1 Like

Hi @pardeeppall,

Unfortunately the CleF software is not an ACCESS-NRI supported product so your issue falls out of scope for our support model. However, I will leave this issue open and on the Hive, as there is no doubt expertise in the community that you could draw on to address your issue.

Cheers, Ben

1 Like

Hi Ben,

Thanks very much for your efforts in looking into this, and very sorry for my late reply. That’s fair enough that CleF software isn’t an ACCESS-NRI supported product, but thanks for keeping the issue open on the Hive. At this point I may well pivot to using the ACCESS-NRI catalogue. Nevertheless, for completeness, I’ll try answering some of your questions:

  1. “… have you tried iterating through the desired vls to see if anything is returned at all? If wildcards are documented to work then I would suggest reaching out to the Clef developers and raising a bug report”

I’ve tried explicit vls, and that works. For example: clef cmip6 -v pr --frequency day -e ssp585 -vl r1i1p1f. However wildcards are indeed not documented, so maybe this functionality simply doesn’t exist.

  1. “The good news is that this is actually returning something, but there is an upper limit on how many results can be returned.”

It’s true that something is returned. But my puzzlement stems from the fact that when I search for just precip (clef cmip6 -v pr --frequency day -e ssp585) I get just a few hundred results. Thus if I search for runs with both precip and temp then surely I’d get, at most, an equal number of results? Instead CLeF tells me there are 19211 results, which is way more than a few hundred. I don’t really understand the response to Will Hobbs’ query, sorry. The –remote flag does at least work, so I could search for precip and temp individually and then filter manually, as I think you’re suggesting.

  1. This one looks like a version issue, it might be time to update the pandas library in hh5. The warning can be ignored, but the ERROR points to an API change.

Again, fair enough that this isn’t an ACCESS-NRI supported issue. But it would be nice if it does ever get fixed, because then I could also use the –stats flag which outputs very useful and nicely formatted summary info.

Thanks again,
Pardeep

It’s more that your first query needs to be more specific to get results from the ESGF. Or you can use the local flag --local and then it won’t be limited, but it won’t make a comparison with what’s available on the ESGF nodes.

1 Like

Hi Paola,

Do you mean that the following query needs to be more specific?
clef cmip6 -v pr --frequency day -e ssp585

Also, when I add the --local flag I get the following error:

clef --local cmip6 -v pr --frequency day -e ssp585
/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/clef/code.py:176: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.

  • df = pd.read_sql(r.selectable, con=session.connection())*
    ERROR: Query must be a string unless using sqlalchemy.

About the pandas issue the answer should have been in the tagged post, briefly just use an older version to use local.
And yes, be more specific maybe using the full member name r1* will match any r10 etc. there’s more than a hundred models and some have a lot of members which will have daily pr and tas.
There’s a lot of documentation and the github will have this issue flagged too. I simply do not have the time to answer questions anymore, as my contract is ending soon, and I have more important work to complete. I also do not monitor the NRI forum regularly.

OK, understood, and thanks for the advice. I’ll see how I get on.

Aloha @pardeeppall,

Noting your workflow isn’t python based can I ask how you parallelize your NCO/CDO operations? Via MPI or similar?

If there were python command line catalog utilities that could interrogate the ACCESS-NRI Intake catalog and output JSON files from serialized Python dictionaries would that be a useful input into your R workflow?

And thanks to @Paola-CMS and @Scott and others for all their CLEX work. I wonder if the elephant in the room is CLeF and other capabilities provided by CLEX staff are not supported going forward and are EOL?

It’s a known elephant, at least for some of the other capabilities, and some plans are being worked through. AFAIK there are no plans for CLeF though.

1 Like

I’ve never had to use parallelization in my work to date (make of that what you will). That’s partly because I can usually filter out a lot of data in the preprocessing stage using CDO/NCO. That said, I believe CDO uses OpenMP for parallelism, if desired.

In any case, I decided to pivot to using the ACCESS-NRI intake catalogue (and thus also learning some python), as it appears well documented and supported. I might make use of parallelization functionality as well.

1 Like

This would be a straightforward feature I think, (except the reporting on CMIP6 data which is not replicated on gadi). It might be worth raising an issue on the intake-catalog github.

1 Like

Thanks @anton - I see the need for this and also muse about the community-wide impact of the potential loss of what the CLEX team delivered. But it’s not needed for my workflow right now and I already have an issue on the intake-catalog github that includes wondering aloud if there should be an open, collaborative repo for developing python utilities that would help support the catalog.