Join in! Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

:rocket: vn19 :rocket:

Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

The purpose of this NCI project is to provide resources to develop and test Analysis-Ready Data (ARD) workflows for climate and ocean modelling on High-Performance Computing (HPC) systems at NCI. This project, supported by the CSIRO, aims to bring together members of the COSIMA community, the Australian Climate Service*, and other interested parties to explore and develop ARD workflows.

*NB: recent ARD discussions have already spawned ACS efforts on Coupled Coastal Hazard Prediciton System (CCHaPS) SCHISM-WWMIII model used in ACS WP3

Project Goals:

The ARD project focuses on:

  • Developing and testing ARD workflows using dedicated storage and compute resources at NCI
  • Crowdsourcing use-cases and solutions from various organisations / projects including COSIMA and the Australian Climate Service (ACS)
  • Fostering community learning and collaboration
  • Exploring the value of using these approaches more systematically for a range of projects and organisations.

Stretch Goals

  • Develop a small, simple package focused on NCI HPC python workflows
  • Enable new scientific publications ( example: Chapman et al 2024 (submitted), Extreme Ocean Conditions in a Western Boundary Current through ‘Oceanic Blocking’ )
  • Publishing data science workflows

Resources

The project has been allocated the following resources by NCI:

  • 100 kSU of compute
  • 5TB gdatga storage
  • 50TB scratch storage

These resources, while modest, provide a foundation for trialing real-life workflows and developing solutions.

Code of Conduct

To ensure a productive and collaborative environment, those involved will be guided by the following principles:

  • Be welcoming and kind to each other
  • Open source only (MIT license)
  • A safe space to discuss the data science that enables science
  • Be generous with knowledge and resources
  • Follow research integrity guidelines ( Australian Code for the Responsible Conduct of Research 2018 )
    • Authorship criteria
      • acquisition of research data where the acquisition has required significant intellectual judgement, planning, design, or input
      • analysis or interpretation of research data
    • Communicate early and often about planned publications
    • As a project evolves, it is important to continue to discuss authorship, especially if new people become involved in the research and make a significant intellectual or scholarly contribution
  • Resources are limited:
    • vn19 is a sandbox - backup anything important elsewhere and don’t completely rely on vn19 for your actual deadlines and deliverables
    • vn19 is ultimately supported by the CSIRO share at NCI and this may influence future priorities for use of resources
    • Resourcing decisions (compute and storage) will be brought openly to the community but the Project lead reserves the right to be a benevolent dictator to maintain institutional support for the project.

Get Involved

We encourage community participation. Here’s how you can get involved:

  1. Say hello and present your use-case or problem here on the ACCESS-Hive Forum
  2. [If you are ready] write an issue in any open GitHub repository and make it known here on the ACCESS-Hive Forum
    1. an open COSIMA repo
    2. best-practice example workflow for loading ensemble ACCESS-ESM1.5 data · Issue #8 · shared-climate-data-problems/CMIP-data-problems · GitHub ( via @jemmajeffree issue Using intake-esm to load an ensemble: real-world problem that might lead to a tutorial/example · Issue #444 · COSIMA/cosima-recipes · GitHub)
    3. any other open repo
  3. Join the vn19 NCI project ( Log in | MyNCI )
2 Likes

Thanks Thomas

A couple of ideas of this could look at to make data analysis-ready:

  • Analysis ready data is more important at higher resolutions and with larger volumes of data. For COSIMA, are there common 3d+daily ocean fields that folks struggle to analyse due to slow processing?
  • Can we spend time to make analysis more agnostic of the data-product used? Whilst lots of data products define their own names for variables, compliance with CF-standards is reasonably common, so can the cf-compliance be leveraged to access data through its CF attributes?
  • Its very common to need and find grids (and their attributes) when doing analysis, are there ways to attach these grids automatically to data products? (e.g. Encoding grid information · Issue #112 · ACCESS-NRI/access-nri-intake-catalog · GitHub)
1 Like

3D+daily ocean fields are indeed a place where I’ve been forced to think about making ARD collections to get results. In my case that’s recently been the BRAN2020 (1993-2023) reanalysis.

Can I echo your question @anton: what 3D+daily ocean fields do current COSIMA staff and students care about? @adele157 @PSpence @edoddridge et al?

Also noting that as work spins up “to identify the mechanisms linking modes of climate variability to weather regimes” the need to move to higher frequency data is likely.

Even in an ocean that is “slow” compared to the atmosphere, monthly output may no longer be “enough” for applications? And if, in terms of those mechanisms, you care about anything below the ocean surface then 3d+daily ocean fields may become higher priority in the near future?

Hey @sb4233, thanks for jumping aboard! When possible are you able to open an issue somewhere public ( perhaps here: Issues · COSIMA/cosima-recipes · GitHub ) to flesh out the details of your specific OM2 use-case that you’re trying to make “better-faster-stronger”? ( i.e.: How to efficiently chunk data for faster processing and plotting? - #6 by sb4233 )

Thanks!

1 Like

One candidate for 3D daily (or even sub-daily) fields is volume and tracer transport and decomposing these into different eddy/mean components. @adele and @claireyung have both done some very impressive work in this space.

3 Likes