Analysing CMIP6 models in gadi using Python?

Hi everyone,

Not sure if this discussion belongs here, but I wanted to check what others do to analyse CMIP6 outputs housed on gadi (i.e. in fs38 and oi10). I’ve chatted to a few people working with CMIP6 and I have noticed that people (including myself) have built up their own bespoke code bases over time which iron out the (many) wrinkles associated with the grids, forcing fields, bugs in variables etc and are tailored to their specific scientific question. Some people use Python, others use CDO. This seems inefficient as there is a lot of repetition. I’m sure there’s a solution out there. I have tried the xmip code but I couldn’t even do the tutorial without it breaking.

Anyway, I am looking for tips/feedback/discussion about a unified code base that pulls models and calculates simple diagnostics (e.g. OHC, sea level etc.) on gadi. Something like a COSIMA cookbook but for CMIP6 analysis on gadi. If someone could point me to the right resource that would be awesome too!

Hi @taimoorsohail,

That’s exactly the sort of things the Model Evaluation and Diagnostics Team at ACCESS-NRI will be working on.

I am working on gathering user experiences / cases to improve the workflow and make it more efficient. Your experience would be very valuable.

Here is the range of things that we will cover:

  • Access to datasets (inputs, outputs, observational)
  • Tools and analysis environments.
  • Model Validation (against observations, historic records), recipes…
  • Diagnostics
  • Inter-models comparisons
  • Routine diagnostics on ACCESS model outputs…
  • etc.

I’ll be interested in having a discussion about your workflow and see if we can identify specific blockages / bottlenecks that we could help with. If you can think of people interested, please tag them and invite them to the discussion.

We (ACCESS-NRI) are here to help!


If you have code that you are happy to share. Happy to have a look.

We can help cleaning things, optimising and write documentation.
We can also help fixing bugs…

Could you tell us where the xmip tutorial fails?
That’s the only code I know of that provides functions to correct issues with the original CMIP6 data, so that doing analysis across models is then possible.
We were planning to add it to the hh5 conda environments.
My first few tests in combination with our intake catalogue were successful, but I didn’t spend much time on it yet.

Creating a Cosima cookbook like tool for CMIP is difficult as Cosima data is consistent, as it’s from the same model, and CMIP6, as you discovered is not.
ESMValTool is an effort to provide a diagnostic and evaluation tool. One of the issues with ESMValTool is that has very specific requirements for how the data is stored, there are attempts to make it work with the NCI collection underway, but I’m not sure how much progress has been done. Also I wouldn’t be surprised if it breaks when there are inconsistencies/errors in the data.

Finally, there are some data collection as the one in project “lp01” from Francois Delage (BoM) where CMIP5/6 monthly data has been already post processed. He might have added daily data too, but I’m not sure.

Responding here at @aidanheerdegen prompting.

I worked a lot (and it was a lot of work!) with CMIP5 data, but I haven’t done a whole lot with CMIP6. To be honest, the tools mentioned in this thread e.g. xmip, ESMValTool have changed things a lot and many of my perspectives are out of date. That said, there are a lot of issues which persist, some of @Paola-CMS has mentioned

It’s pretty common for modelling groups to not follow the strict CMIP definitions/standards, so a lot of time is spent tracing/correcting exceptions. a ‘cookbook’ style tool would be great but it would need
a) the capacity for community users to flag issues and add corrections to the code base
b) a user community that doesn’t assume the output is correct (i.e. no ‘black boxing’).

This latter point is, in my mind, the most important

Thank @willrhobbs! Did you just end up rolling your own scripts and tools?

Something like cf_xarray should make creating relatively data-agnostic scripts possible.

Add in datatree and it should be possible to do some quite powerful general analyses

Did you just end up rolling your own scripts and tools?

I did, mostly using a combination of NCL and @Paola-CMS python search tools to find the data. It was what was available at the time, but it’s really not a very time-effective way of doing things, at least for the community.

Despite my previous post, I do think there probably should be some library for dealing with the CMIP data, I’m just aware that, as Paola stated, exception-handling is a problem.

No matter how well-structured the code is (and some of the new python-based tools are great), they won’t spot when the variable name is sea ice thickness, but it’s actually sea ice volume… (and yes, I’ve seen this recently, presented at an international conference :worried:).

My point is that it’s not just a coding/engineering problem, it requires some scientific auditing as well.

Hi everyone, thanks for your feedback and links to post processing software that could be used. I agree 100% with @willrhobbs - the issue is mainly auditing the different models to make sure they have reasonable outputs. My workflow is:

  1. Use clef to find relevant datasets (by the way, this isn’t very reliable as the ESGF versions often don’t show up in the clef search if the server is offline/some other issue)
  2. Request a data download from NCI
  3. Sort through the ensemble members downloaded to make sure they don’t contain junk variables (this takes a while, especially when you have many ensemble members and modelling groups)
  4. Write a Python script that reads the relevant data and outputs processed data in the form of netCDF files. This script is submitted as a batch job for large (4D) datasets/long time series (if analysing piControl)
  5. Analyse output files, find a bug in the variable/grid/ensemble member, return to step 3 :sweat_smile:

A lot of the work is in ‘scientific auditing’ as Will says, to track down issues in the grid setup, variables, etc. I’m convinced there’s a better way than all of us doing this at the same time in parallel. For example, Damien Irving has a helpful table with issues he has found here: ocean-analysis/ at master · DamienIrving/ocean-analysis · GitHub

If we could do something ‘crowdsourced’ like that, it would be very helpful for steps 1-3 and 5 above. I’m sure xMIP/datatree/cf_xarray can sort out step #4.

Tagging @saurabh_rathore as I think he is doing similar things currently.

@taimoorsohail I think that combining our intake catalogue with xmip (or a similar tool if exists) and building on xmip might be the better way to go. In that way you leverage on a much larger community. At least for the difficult first couple of steps of cleaning up the data.
We tried when we started looking at solution for CMIP5, to involve the community and store in clef also warnings on “faulty” data, so at least you would get notified when something was wrong, but there wasn’t enough community participation.
We might also be able to introduce something similar in intake, but we need users to contribute the issues.
About step 1 being unreliable, if you try to compare what you have locally to what’s online and the online system isn’t working, there’s no much we can do. That’s an ESGF issue.


Thanks @Paola-CMS! I will combine the intake catalogue with xmip and see how that works. Will get back with updates on whether everything is working.

One thing that I think might still be missing from these resources is some code to calculate simple post-processing variables. For example, it would be very useful to calculate the ocean heat content (OHC) in the CMIP6/5 models and I know this is something people incidentally need for their work without wanting to get stuck into the entire catalogue and all its issues (e.g. if they just want to quickly compare their own model with existing climate models). Another example is sea level (steric/dynamic) calculations @saurabh_rathore.

Does a code base with examples like these exist? If not, would it be worth creating (keeping in mind the fact that the models all have different quirks)? I am happy to provide an example script for calculating OHC using the intake catalog + xmip, for example. This will also avoid the problem of ‘black-boxing’ if we just put the actual post-processed variables up on NCI somewhere.

@Paola-CMS could you let me know if/when you get to adding xmip to the conda environment so I can start on this? Thank you :slight_smile:

@taimoorsohail I will,
I will also share the notebook I started to create a “local” demo, as the Pangeo tutorial, as we don’t necessarily have the same data. You could contribute to that, as you might know what models/exp have issues already.

1 Like

@taimoorsohail xmip is in conda/analysis3-unstable and I modified the original tutorial so it works with our intake on Gadi:

feel free to contribute more examples, I mostly make sure that he loaded the data, the examples needs double checking to make sure they’re still relevant