CMIP7 Evaluation Group - Summary

Summary of CMIP7 evaluation discussions at and since the CMIP7 Hackathon in March 2024

Extensive notetaking occurred during the Hackathon, with separate notes posted on the Hive/Google Docs for each breakout group (atmosphere, land, ocean/ice, coupled/general diagnostics) as well as a summary of all breakouts and the main workshop sessions, the ACCESS-NRI summaries on Github and some additional links to existing variable/diagnostic lists on the Hive/Google Sheets (e.g. here and here). The first major outstanding issue raised across these disparate groups was the need for the community to create and/or compile disparate existing lists into a centralised, detailed list of diagnostics and metrics required for sanity checking/evaluation, along with the necessary variables for output, and the relevant timescale. Some key points for this list are as follows:

  • Sharing a single list across the community reduces the risk of unnecessary duplication, as several diagnostics overlap between different interest groups.
  • A hierarchy for the list was mentioned several times, including by priority (urgent, essential, low) as well as by ‘sanity check’ or ‘systematic evaluation’. The latter point was deemed particularly important, as failure of a routine sanity check can preclude further systematic evaluation, thus saving resources.
    → The land breakout group broke this down further, into: a) initial model build, getting it to run; b) model running, sanity checking for obvious errors and physical impossibilities; c) trusted model – testing to see if forcings are set up correctly
    → What happens when a sanity check is failed? Does it give a warning or kill the run?
  • Timeframe is another important aspect for inclusion, as some checks can be done quickly by run-time routine diagnostics, some require timeseries to get a mean or range of values, and others require much more established runs (e.g. ENSO, ocean climatology).
  • Within the centralised list, a list of reference thresholds and datasets that will be used to compare raw model output is also required, clearly describing the necessary observation-based references and previous model configurations.
  • An additional column is needed within the list to express the most useful tool for evaluating each metric.
  • A further column within the list shall detail which experiments are required (e.g. control, historical, 4xco2, etc.)

An additional conversation is required among the community about exactly when and how reference datasets will be used to determine if something is ‘wrong’, as several workshop attendees highlighted that, for example, reanalysis data is not necessarily ‘truth’, and that it is important to consider internal variability. Some checks are already built into the model components, such as defined ranges (e.g. in MOM for global ocean temperature), by which exceptions and warnings are raised when these bounds are exceeded. Further points:

  • A common theme throughout the community is a preference for a Southern Hemisphere focus, with heavier weighting of priority regions such as the Australian continent and the oceans of the Southern Hemisphere. When considering bounds or ranges for ‘acceptable’ values of model metrics, more stringent thresholds can be applied to these regions while relaxing standards elsewhere, to focus model development on the realism of climate aspects such as the IOD, ENSO, SAM, Antarctic Bottom Water formation, the Antarctic Circumpolar Current, Antarctic sea ice (particularly in summer), and variability around the Australian continent.
  • It was highlighted that for the initial stages of model development, evaluation and testing must be balanced between simplicity and effectiveness – relatively easy assessments such as simple indices and yes/no checks (e.g. is summer sea ice present/over a threshold value y/n?) will be most beneficial.

In tandem with the compilation of a single, centralised list of diagnostic/metrics required for evaluation, the community can move towards answering questions around which tools are needed for which metrics (which can be a column in the centralised list), which are routine run-time checks, and so on.

Tracking and documentation was a strong focus of the community. Detailed documentation of everything (recipes, tests, parameterisations, figures) was mentioned multiple times, as well as peer review of recipes and output files to ensure robustness. Some sort of tool will be required to track progress and results and to stimulate discussion, such as perhaps a separate GitHub project for CMIP7-specific ACCESS output evaluation.

Following the workshop discussions, additional discussions are required around timelines for sanity checking vs evaluation, and the establishment of workflows, specifically what from the centralised list needs to be routinely checked through the run-time diagnostics tool.

Limited discussion has been held so far about capability (i.e. who will do what).

  • It was suggested during the workshop that metrics be automated where possible, with the output made available for community viewing (GitHub/Hive).
  • Another option was raised that separate workshops be held for special interest groups, (e.g. Australian climate, ENSO, sea ice etc).
  • Need to decide on a timeline between the completion of a run for testing and the evaluation/feedback phase (i.e. how long does the community have to assess each run?)
  • Concerns have been raised about data storage, in that sanity-checking data can pile up quickly and can’t all be stored; there has to be some way of choosing what to save and what to discard.
  • Maintaining momentum and engagement across groups:
    → Monthly ‘model evaluation’ meetings encompassing all working/interest groups to discuss new/existing recipes/ideas, testing and quality updates
    → Meeting frequency to be increased as model runs become available

Tools and technology

The ACCESS-NRI summary on GitHub provides an excellent perspective. Some additional points:

  • Community currently uses wide range of tools (e.g. Python, NCL, Ferret, ILAMB, Benchcab, Xarray, Dask
  • Large swathes of the community are continuing to experience challenges adopting tools such as ESMValTool. Suggestions from the workshop and subsequent discussions include:
    → Additional training material or walkthrough examples, as well as more widely advertising existing training material and drop-in sessions.
    → Training will be most effective if users explicitly suggest what they need, in order to focus training on areas of greatest value for current needs.
    → Some users suggested the provision of blank recipe templates and example recipes, as well as standard recipes (e.g. Taylor diagram for mean) where users need only change the variable / simulation / region of interest.
    → Assistance with porting individual’s evaluation scripts into ESMValTool
    → Ability to use non-CMORised data
    –>Enabling flags for comparison with subsets of CMIP5 or CMIP6 models (so users don’t have to know specific models to call when running scripts)
  • There continues to be much interest in the ability to run ESMValTool directly from Jupyter notebooks.
  • Workshop attendees expressed concern about the compatibility of ERA5 dataset with ESMValTool, as well as how to use ESMValTool as command line tool.
  • Difficulty was encountered when attempting to add observational datasets to Gadi.
  • Placement and/or storage of data from initial runs (requires a shared space on NCI)

Hackathon v2.0

A second Hackathon has been proposed, potentially held alongside the ACCESS Community Workshop in Canberra in September, 2024.

In addition to community-wide discussions of the above points, some high priority computing tasks to undertake during a second Hackathon could be:

  • Renaming of recipes to make them easier to find
  • Testing and debugging existing ESMValTool recipes
  • Documentation of existing ESMValTool recipes (currently very poor or non-existent in a lot of cases)
  • Conversion of personal scripts to ESMValTool recipes
  • Porting or linking existing COSIMA recipes to the workflow
  • Creation of new recipes
  • Checking PCMDI examples for new recipe ideas
  • Correcting non-SI units in ESMValTool scripts

Hopefully this summary will be of use in guiding planning towards establishment of workflows and ensuring our resources are aimed at priority areas.