Join in! Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

:rocket: vn19 :rocket:

Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

The purpose of this NCI project is to provide resources to develop and test Analysis-Ready Data (ARD) workflows for climate and ocean modelling on High-Performance Computing (HPC) systems at NCI. This project, supported by the CSIRO, aims to bring together members of the COSIMA community, the Australian Climate Service*, and other interested parties to explore and develop ARD workflows.

*NB: recent ARD discussions have already spawned ACS efforts on Coupled Coastal Hazard Prediciton System (CCHaPS) SCHISM-WWMIII model used in ACS WP3

Project Goals:

The ARD project focuses on:

  • Developing and testing ARD workflows using dedicated storage and compute resources at NCI
  • Crowdsourcing use-cases and solutions from various organisations / projects including COSIMA and the Australian Climate Service (ACS)
  • Fostering community learning and collaboration
  • Exploring the value of using these approaches more systematically for a range of projects and organisations.

Stretch Goals

  • Develop a small, simple package focused on NCI HPC python workflows
  • Enable new scientific publications ( example: Chapman et al 2024 (submitted), Extreme Ocean Conditions in a Western Boundary Current through ā€˜Oceanic Blockingā€™ )
  • Publishing data science workflows

Resources

The project has been allocated the following resources by NCI:

  • 100 kSU of compute
  • 5TB gdatga storage
  • 50TB scratch storage

These resources, while modest, provide a foundation for trialing real-life workflows and developing solutions.

Code of Conduct

To ensure a productive and collaborative environment, those involved will be guided by the following principles:

  • Be welcoming and kind to each other
  • Open source only (MIT license)
  • A safe space to discuss the data science that enables science
  • Be generous with knowledge and resources
  • Follow research integrity guidelines ( Australian Code for the Responsible Conduct of Research 2018 )
    • Authorship criteria
      • acquisition of research data where the acquisition has required significant intellectual judgement, planning, design, or input
      • analysis or interpretation of research data
    • Communicate early and often about planned publications
    • As a project evolves, it is important to continue to discuss authorship, especially if new people become involved in the research and make a significant intellectual or scholarly contribution
  • Resources are limited:
    • vn19 is a sandbox - backup anything important elsewhere and donā€™t completely rely on vn19 for your actual deadlines and deliverables
    • vn19 is ultimately supported by the CSIRO share at NCI and this may influence future priorities for use of resources
    • Resourcing decisions (compute and storage) will be brought openly to the community but the Project lead reserves the right to be a benevolent dictator to maintain institutional support for the project.

Get Involved

We encourage community participation. Hereā€™s how you can get involved:

  1. Say hello and present your use-case or problem here on the ACCESS-Hive Forum
  2. [If you are ready] write an issue in any open GitHub repository and make it known here on the ACCESS-Hive Forum
    1. an open COSIMA repo
    2. best-practice example workflow for loading ensemble ACCESS-ESM1.5 data Ā· Issue #8 Ā· shared-climate-data-problems/CMIP-data-problems Ā· GitHub ( via @jemmajeffree issue Using intake-esm to load an ensemble: real-world problem that might lead to a tutorial/example Ā· Issue #444 Ā· COSIMA/cosima-recipes Ā· GitHub)
    3. any other open repo
  3. Join the vn19 NCI project ( Log in | MyNCI )
2 Likes

Thanks Thomas

A couple of ideas of this could look at to make data analysis-ready:

  • Analysis ready data is more important at higher resolutions and with larger volumes of data. For COSIMA, are there common 3d+daily ocean fields that folks struggle to analyse due to slow processing?
  • Can we spend time to make analysis more agnostic of the data-product used? Whilst lots of data products define their own names for variables, compliance with CF-standards is reasonably common, so can the cf-compliance be leveraged to access data through its CF attributes?
  • Its very common to need and find grids (and their attributes) when doing analysis, are there ways to attach these grids automatically to data products? (e.g. Encoding grid information Ā· Issue #112 Ā· ACCESS-NRI/access-nri-intake-catalog Ā· GitHub)
1 Like

3D+daily ocean fields are indeed a place where Iā€™ve been forced to think about making ARD collections to get results. In my case thatā€™s recently been the BRAN2020 (1993-2023) reanalysis.

Can I echo your question @anton: what 3D+daily ocean fields do current COSIMA staff and students care about? @adele157 @PSpence @edoddridge et al?

Also noting that as work spins up ā€œto identify the mechanisms linking modes of climate variability to weather regimesā€ the need to move to higher frequency data is likely.

Even in an ocean that is ā€œslowā€ compared to the atmosphere, monthly output may no longer be ā€œenoughā€ for applications? And if, in terms of those mechanisms, you care about anything below the ocean surface then 3d+daily ocean fields may become higher priority in the near future?

Hey @sb4233, thanks for jumping aboard! When possible are you able to open an issue somewhere public ( perhaps here: Issues Ā· COSIMA/cosima-recipes Ā· GitHub ) to flesh out the details of your specific OM2 use-case that youā€™re trying to make ā€œbetter-faster-strongerā€? ( i.e.: How to efficiently chunk data for faster processing and plotting? - #6 by sb4233 )

Thanks!

1 Like

One candidate for 3D daily (or even sub-daily) fields is volume and tracer transport and decomposing these into different eddy/mean components. @adele and @claireyung have both done some very impressive work in this space.

3 Likes

Thanks Ed! It would be helpful to have some processing of data for eddy-mean decompositions - that could make analysis much easier, especially for the 1/10th degree and higher resolution models. I had a think about this but Iā€™m struggling with the balance between usefulness to a large group of people vs being specific enough that it improves efficiency.

For example: to compute eddy-mean transport across an isobath or SSH contour binned in density space, you need a huge amount of 3D daily data. Usually for this type of computation I ran lots of parallel jobs on gadi (for each month, say e.g. cosima-recipes/Tutorials/Submitting_analysis_jobs_to_gadi.ipynb at main Ā· COSIMA/cosima-recipes Ā· GitHub), saved the output and then combined it later (I suppose this splitting up is a project-specific analysis-ready workflow). Since you need to choose your variable, contour and density bins, most of the data is very project-specific.

Potentially rechunking and saving some data only at particular latitude bands might help Antarctic (or other regions) transport calculations by reducing the memory overhead (I have no idea of the amount of speed-up you get from this, though), and doing some general conversions of thermodynamic variables e.g. practical salinity to absolute salinity or daily T and S to different pot_rhoā€™s might skip a few steps for people doing density binning but requires a lot of space to save.

Maybe computing EKE from daily 3d velocities for say each month/year/decade (noting you pick up different processes in each time period) of the RYF/IAF is a good candidate for something people use that is fairly general?

2 Likes

Thanks @claireyung. Could you flesh out the detail here ( assume the reader knows nothing ) either in an issue ( COSIMA repo or somewhere else public ) or just here in a reply?