Join in! Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

Thomas-Moore · 26 September 2024 01:21

vn19

Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI

The purpose of this NCI project is to provide resources to develop and test Analysis-Ready Data (ARD) workflows for climate and ocean modelling on High-Performance Computing (HPC) systems at NCI. This project, supported by the CSIRO, aims to bring together members of the COSIMA community, the Australian Climate Service*, and other interested parties to explore and develop ARD workflows.

*NB: recent ARD discussions have already spawned ACS efforts on Coupled Coastal Hazard Prediciton System (CCHaPS) SCHISM-WWMIII model used in ACS WP3

Project Goals:

The ARD project focuses on:

Developing and testing ARD workflows using dedicated storage and compute resources at NCI
Crowdsourcing use-cases and solutions from various organisations / projects including COSIMA and the Australian Climate Service (ACS)
Fostering community learning and collaboration
Exploring the value of using these approaches more systematically for a range of projects and organisations.

Stretch Goals

Develop a small, simple package focused on NCI HPC python workflows
Enable new scientific publications ( example: Chapman et al 2024 (submitted), Extreme Ocean Conditions in a Western Boundary Current through ‘Oceanic Blocking’ )
Publishing data science workflows

Resources

The project has been allocated the following resources by NCI:

100 kSU of compute
5TB gdatga storage
50TB scratch storage

These resources, while modest, provide a foundation for trialing real-life workflows and developing solutions.

Code of Conduct

To ensure a productive and collaborative environment, those involved will be guided by the following principles:

Be welcoming and kind to each other
Open source only (MIT license)
A safe space to discuss the data science that enables science
Be generous with knowledge and resources
Follow research integrity guidelines ( Australian Code for the Responsible Conduct of Research 2018 )
- Authorship criteria
  - acquisition of research data where the acquisition has required significant intellectual judgement, planning, design, or input
  - analysis or interpretation of research data
- Communicate early and often about planned publications
- As a project evolves, it is important to continue to discuss authorship, especially if new people become involved in the research and make a significant intellectual or scholarly contribution
Resources are limited:
- vn19 is a sandbox - backup anything important elsewhere and don’t completely rely on vn19 for your actual deadlines and deliverables
- vn19 is ultimately supported by the CSIRO share at NCI and this may influence future priorities for use of resources
- Resourcing decisions (compute and storage) will be brought openly to the community but the Project lead reserves the right to be a benevolent dictator to maintain institutional support for the project.

Get Involved

We encourage community participation. Here’s how you can get involved:

Say hello and present your use-case or problem here on the ACCESS-Hive Forum
[If you are ready] write an issue in any open GitHub repository and make it known here on the ACCESS-Hive Forum
1. an open COSIMA repo
2. best-practice example workflow for loading ensemble ACCESS-ESM1.5 data · Issue #8 · shared-climate-data-problems/CMIP-data-problems · GitHub ( via @jemmajeffree issue Using intake-esm to load an ensemble: real-world problem that might lead to a tutorial/example · Issue #444 · COSIMA/cosima-recipes · GitHub)
3. any other open repo
Join the vn19 NCI project ( Log in | MyNCI )

anton · 26 September 2024 03:22

Thanks Thomas

A couple of ideas of this could look at to make data analysis-ready:

Analysis ready data is more important at higher resolutions and with larger volumes of data. For COSIMA, are there common 3d+daily ocean fields that folks struggle to analyse due to slow processing?
Can we spend time to make analysis more agnostic of the data-product used? Whilst lots of data products define their own names for variables, compliance with CF-standards is reasonably common, so can the cf-compliance be leveraged to access data through its CF attributes?
Its very common to need and find grids (and their attributes) when doing analysis, are there ways to attach these grids automatically to data products? (e.g. Encoding grid information · Issue #112 · ACCESS-NRI/access-nri-intake-catalog · GitHub)

Thomas-Moore · 26 September 2024 03:45

3D+daily ocean fields are indeed a place where I’ve been forced to think about making ARD collections to get results. In my case that’s recently been the BRAN2020 (1993-2023) reanalysis.

Can I echo your question @anton: what 3D+daily ocean fields do current COSIMA staff and students care about? @adele157 @PSpence @edoddridge et al?

Thomas-Moore · 27 September 2024 03:21

Also noting that as work spins up “to identify the mechanisms linking modes of climate variability to weather regimes” the need to move to higher frequency data is likely.

Even in an ocean that is “slow” compared to the atmosphere, monthly output may no longer be “enough” for applications? And if, in terms of those mechanisms, you care about anything below the ocean surface then 3d+daily ocean fields may become higher priority in the near future?

Thomas-Moore · 29 September 2024 20:52

Hey @sb4233, thanks for jumping aboard! When possible are you able to open an issue somewhere public ( perhaps here: Issues · COSIMA/cosima-recipes · GitHub ) to flesh out the details of your specific OM2 use-case that you’re trying to make “better-faster-stronger”? ( i.e.: How to efficiently chunk data for faster processing and plotting? - #6 by sb4233 )

Thanks!

edoddridge · 9 October 2024 22:58

One candidate for 3D daily (or even sub-daily) fields is volume and tracer transport and decomposing these into different eddy/mean components. @adele and @claireyung have both done some very impressive work in this space.

claireyung · 16 October 2024 05:24

Thanks Ed! It would be helpful to have some processing of data for eddy-mean decompositions - that could make analysis much easier, especially for the 1/10th degree and higher resolution models. I had a think about this but I’m struggling with the balance between usefulness to a large group of people vs being specific enough that it improves efficiency.

For example: to compute eddy-mean transport across an isobath or SSH contour binned in density space, you need a huge amount of 3D daily data. Usually for this type of computation I ran lots of parallel jobs on gadi (for each month, say e.g. cosima-recipes/Tutorials/Submitting_analysis_jobs_to_gadi.ipynb at main · COSIMA/cosima-recipes · GitHub), saved the output and then combined it later (I suppose this splitting up is a project-specific analysis-ready workflow). Since you need to choose your variable, contour and density bins, most of the data is very project-specific.

Potentially rechunking and saving some data only at particular latitude bands might help Antarctic (or other regions) transport calculations by reducing the memory overhead (I have no idea of the amount of speed-up you get from this, though), and doing some general conversions of thermodynamic variables e.g. practical salinity to absolute salinity or daily T and S to different pot_rho’s might skip a few steps for people doing density binning but requires a lot of space to save.

Maybe computing EKE from daily 3d velocities for say each month/year/decade (noting you pick up different processes in each time period) of the RYF/IAF is a good candidate for something people use that is fairly general?

Thomas-Moore · 16 October 2024 20:45

Thanks @claireyung. Could you flesh out the detail here ( assume the reader knows nothing ) either in an issue ( COSIMA repo or somewhere else public ) or just here in a reply?

Topic		Replies	Views
Ocean model development and machine learning meeting in Boulder COSIMA	11	315	19 August 2024
COSIMA @ Ocean Sciences Meeting 2024 COSIMA cosima	2	605	14 February 2024
ACCESS Workshop Training Day - what should we offer? Training workshop	0	140	12 June 2024
Workshop on Correctness and Reproducibility for Climate and Weather Software General workshop , testing , reproducibility	0	209	12 September 2023
COSIMA Working Group Reference Datasets FY23-24 Working Group data , storage , reference	15	1151	1 October 2024