Analysing CMIP6 models in gadi using Python?

taimoorsohail · 23 November 2022 05:41

Hi everyone,

Not sure if this discussion belongs here, but I wanted to check what others do to analyse CMIP6 outputs housed on gadi (i.e. in fs38 and oi10). I’ve chatted to a few people working with CMIP6 and I have noticed that people (including myself) have built up their own bespoke code bases over time which iron out the (many) wrinkles associated with the grids, forcing fields, bugs in variables etc and are tailored to their specific scientific question. Some people use Python, others use CDO. This seems inefficient as there is a lot of repetition. I’m sure there’s a solution out there. I have tried the xmip code but I couldn’t even do the tutorial without it breaking.

Anyway, I am looking for tips/feedback/discussion about a unified code base that pulls models and calculates simple diagnostics (e.g. OHC, sea level etc.) on gadi. Something like a COSIMA cookbook but for CMIP6 analysis on gadi. If someone could point me to the right resource that would be awesome too!

rbeucher · 24 November 2022 00:55

Hi @taimoorsohail,

That’s exactly the sort of things the Model Evaluation and Diagnostics Team at ACCESS-NRI will be working on.

I am working on gathering user experiences / cases to improve the workflow and make it more efficient. Your experience would be very valuable.

Here is the range of things that we will cover:

Access to datasets (inputs, outputs, observational)
Tools and analysis environments.
Model Validation (against observations, historic records), recipes…
Diagnostics
Inter-models comparisons
Routine diagnostics on ACCESS model outputs…
etc.

I’ll be interested in having a discussion about your workflow and see if we can identify specific blockages / bottlenecks that we could help with. If you can think of people interested, please tag them and invite them to the discussion.

We (ACCESS-NRI) are here to help!

Romain

rbeucher · 24 November 2022 00:58

If you have code that you are happy to share. Happy to have a look.

We can help cleaning things, optimising and write documentation.
We can also help fixing bugs…

Paola-CMS · 24 November 2022 03:47

Could you tell us where the xmip tutorial fails?
That’s the only code I know of that provides functions to correct issues with the original CMIP6 data, so that doing analysis across models is then possible.
We were planning to add it to the hh5 conda environments.
My first few tests in combination with our intake catalogue were successful, but I didn’t spend much time on it yet.

Creating a Cosima cookbook like tool for CMIP is difficult as Cosima data is consistent, as it’s from the same model, and CMIP6, as you discovered is not.
ESMValTool is an effort to provide a diagnostic and evaluation tool. One of the issues with ESMValTool is that has very specific requirements for how the data is stored, there are attempts to make it work with the NCI collection underway, but I’m not sure how much progress has been done. Also I wouldn’t be surprised if it breaks when there are inconsistencies/errors in the data.

Finally, there are some data collection as the one in project “lp01” from Francois Delage (BoM) where CMIP5/6 monthly data has been already post processed. He might have added daily data too, but I’m not sure.

willrhobbs · 24 November 2022 04:48

Responding here at @Aidan prompting.

I worked a lot (and it was a lot of work!) with CMIP5 data, but I haven’t done a whole lot with CMIP6. To be honest, the tools mentioned in this thread e.g. xmip, ESMValTool have changed things a lot and many of my perspectives are out of date. That said, there are a lot of issues which persist, some of @Paola-CMS has mentioned

It’s pretty common for modelling groups to not follow the strict CMIP definitions/standards, so a lot of time is spent tracing/correcting exceptions. a ‘cookbook’ style tool would be great but it would need
a) the capacity for community users to flag issues and add corrections to the code base
b) a user community that doesn’t assume the output is correct (i.e. no ‘black boxing’).

This latter point is, in my mind, the most important

Aidan · 24 November 2022 04:53

Thank @willrhobbs! Did you just end up rolling your own scripts and tools?

Something like cf_xarray should make creating relatively data-agnostic scripts possible.

Add in datatree and it should be possible to do some quite powerful general analyses

willrhobbs · 24 November 2022 05:04

Did you just end up rolling your own scripts and tools?

I did, mostly using a combination of NCL and @Paola-CMS python search tools to find the data. It was what was available at the time, but it’s really not a very time-effective way of doing things, at least for the community.

Despite my previous post, I do think there probably should be some library for dealing with the CMIP data, I’m just aware that, as Paola stated, exception-handling is a problem.

No matter how well-structured the code is (and some of the new python-based tools are great), they won’t spot when the variable name is sea ice thickness, but it’s actually sea ice volume… (and yes, I’ve seen this recently, presented at an international conference ).

My point is that it’s not just a coding/engineering problem, it requires some scientific auditing as well.

taimoorsohail · 24 November 2022 05:20

Hi everyone, thanks for your feedback and links to post processing software that could be used. I agree 100% with @willrhobbs - the issue is mainly auditing the different models to make sure they have reasonable outputs. My workflow is:

Use clef to find relevant datasets (by the way, this isn’t very reliable as the ESGF versions often don’t show up in the clef search if the server is offline/some other issue)
Request a data download from NCI
Sort through the ensemble members downloaded to make sure they don’t contain junk variables (this takes a while, especially when you have many ensemble members and modelling groups)
Write a Python script that reads the relevant data and outputs processed data in the form of netCDF files. This script is submitted as a batch job for large (4D) datasets/long time series (if analysing piControl)
Analyse output files, find a bug in the variable/grid/ensemble member, return to step 3

A lot of the work is in ‘scientific auditing’ as Will says, to track down issues in the grid setup, variables, etc. I’m convinced there’s a better way than all of us doing this at the same time in parallel. For example, Damien Irving has a helpful table with issues he has found here: ocean-analysis/cmip6_notes.md at master · DamienIrving/ocean-analysis · GitHub

If we could do something ‘crowdsourced’ like that, it would be very helpful for steps 1-3 and 5 above. I’m sure xMIP/datatree/cf_xarray can sort out step #4.

Tagging @saurabh_rathore as I think he is doing similar things currently.

Paola-CMS · 24 November 2022 21:54

@taimoorsohail I think that combining our intake catalogue with xmip (or a similar tool if exists) and building on xmip might be the better way to go. In that way you leverage on a much larger community. At least for the difficult first couple of steps of cleaning up the data.
We tried when we started looking at solution for CMIP5, to involve the community and store in clef also warnings on “faulty” data, so at least you would get notified when something was wrong, but there wasn’t enough community participation.
We might also be able to introduce something similar in intake, but we need users to contribute the issues.
About step 1 being unreliable, if you try to compare what you have locally to what’s online and the online system isn’t working, there’s no much we can do. That’s an ESGF issue.

taimoorsohail · 25 November 2022 01:05

Thanks @Paola-CMS! I will combine the intake catalogue with xmip and see how that works. Will get back with updates on whether everything is working.

One thing that I think might still be missing from these resources is some code to calculate simple post-processing variables. For example, it would be very useful to calculate the ocean heat content (OHC) in the CMIP6/5 models and I know this is something people incidentally need for their work without wanting to get stuck into the entire catalogue and all its issues (e.g. if they just want to quickly compare their own model with existing climate models). Another example is sea level (steric/dynamic) calculations @saurabh_rathore.

Does a code base with examples like these exist? If not, would it be worth creating (keeping in mind the fact that the models all have different quirks)? I am happy to provide an example script for calculating OHC using the intake catalog + xmip, for example. This will also avoid the problem of ‘black-boxing’ if we just put the actual post-processed variables up on NCI somewhere.

taimoorsohail · 25 November 2022 01:20

@Paola-CMS could you let me know if/when you get to adding xmip to the conda environment so I can start on this? Thank you

Paola-CMS · 25 November 2022 01:28

@taimoorsohail I will,
I will also share the notebook I started to create a “local” demo, as the Pangeo tutorial, as we don’t necessarily have the same data. You could contribute to that, as you might know what models/exp have issues already.

Paola-CMS · 25 November 2022 03:03

@taimoorsohail xmip is in conda/analysis3-unstable and I modified the original tutorial so it works with our intake on Gadi:

feel free to contribute more examples, I mostly make sure that he loaded the data, the examples needs double checking to make sure they’re still relevant

Benoit · 10 May 2024 05:59

Hi @Paola-CMS, sorry for digging this up, but I’d like to try this myself and could not find the conda/analysis3-unstable environment (I thought it would be in /g/data/hh5/public/modules). Maybe it was removed? Or maybe I’m just confused on how to load it? if it was removed, is there another environment with xmip included that you know about?

Paola-CMS · 10 May 2024 06:43

Unstable is just the environment we add new packages until it get “frozen” and become stable. So as this post is from 2022 any of the 2023/24 envs will have xmip.
If you can’t load “unstable” probably you haven’t executed “module use /g/data/hh5/public/modules” which should make all the env available.
You can also use the actual env names to load them, to see all available ones do
“module avail conda/analysis3” after having run “module use”
Incidentally Dale is updating our stable/unstable envs today so my advice would be to use an env like “conda/analysis3-24.01” to avoid confusion for the moment.
Hope this helps

Benoit · 10 May 2024 10:35

Oh OK great! I think that’s all I needed Thank you for your quick reply!

Benoit · 15 May 2024 02:11

Thanks again @Paola-CMS for helping out. I just gave the notebook a try and stumbled on a few errors, for which I posted an issue on the xmip_nci GitHub repository. I hope this is useful/helpful!

Paola-CMS · 15 May 2024 05:53

Hi Benoit,
I don’t have time to test/update this today, but that example was written a while ago and it’s using our intake catalogues (I kind of forgot that) not the NCI ones. It was adapted to datasets available at NCI at the time, it’s possible that things have changed since. As NCI now provide intake catalogues we stopped maintaining ours. I will try to have a look later in the week

Benoit · 15 May 2024 07:27

That’s already very helpful, thank you. Last time I could access the NCI user guides concerning intake they suggested the following:

cmip6 = intake.open_esm_datastore("/g/data/dk92/catalog/v2/esm/cmip6-oi10/catalog.json")

Is that the NCI-maintained catalog you just refered to?

Paola-CMS · 15 May 2024 07:55

yes, and I forgot to mention that you might have to add the fs38 storage flag to ARE if you haven’t. CMIP is spread across 4 different projects, gadi should have all the ACCESS CMIP6 data which is stored in fs38

Topic		Replies	Views
Session 6: Breakout room 2: Model evaluation, testing framework CMIP7 Workshop workshop , cmip7	1	207	1 March 2023
Issues using CLeF on gadi to find CMIP6 runs Technical help , model-evaluation , outofscope , cmip6	16	209	16 October 2024
Ocean Team Workshop (12-15th November, 2024) TWG workshop , ocean	13	601	28 November 2024
ACCESS-ESM1.6 scope and use for CMIP7 Fast Track CMIP7 Models	18	375	19 July 2024
CSIRO - ACCESS-NRI standup minutes CMIP7 development	45	904	29 July 2025

Analysing CMIP6 models in gadi using Python?

Related topics