Investigating analysis-ready data (ARD) strategies to increase impact of ocean and climate model archives at NCI
The purpose of this NCI project is to provide resources to develop and test Analysis-Ready Data (ARD) workflows for climate and ocean modelling on High-Performance Computing (HPC) systems at NCI. This project, supported by the CSIRO, aims to bring together members of the COSIMA community, the Australian Climate Service*, and other interested parties to explore and develop ARD workflows.
*NB: recent ARD discussions have already spawned ACS efforts on Coupled Coastal Hazard Prediciton System (CCHaPS) SCHISM-WWMIII model used in ACS WP3
Project Goals:
The ARD project focuses on:
Developing and testing ARD workflows using dedicated storage and compute resources at NCI
Crowdsourcing use-cases and solutions from various organisations / projects including COSIMA and the Australian Climate Service (ACS)
Fostering community learning and collaboration
Exploring the value of using these approaches more systematically for a range of projects and organisations.
Stretch Goals
Develop a small, simple package focused on NCI HPC python workflows
Enable new scientific publications ( example: Chapman et al 2024 (submitted), Extreme Ocean Conditions in a Western Boundary Current through āOceanic Blockingā )
Publishing data science workflows
Resources
The project has been allocated the following resources by NCI:
100 kSU of compute
5TB gdatga storage
50TB scratch storage
These resources, while modest, provide a foundation for trialing real-life workflows and developing solutions.
Code of Conduct
To ensure a productive and collaborative environment, those involved will be guided by the following principles:
Be welcoming and kind to each other
Open source only (MIT license)
A safe space to discuss the data science that enables science
acquisition of research data where the acquisition has required significant intellectual judgement, planning, design, or input
analysis or interpretation of research data
Communicate early and often about planned publications
As a project evolves, it is important to continue to discuss authorship, especially if new people become involved in the research and make a significant intellectual or scholarly contribution
Resources are limited:
vn19 is a sandbox - backup anything important elsewhere and donāt completely rely on vn19 for your actual deadlines and deliverables
vn19 is ultimately supported by the CSIRO share at NCI and this may influence future priorities for use of resources
Resourcing decisions (compute and storage) will be brought openly to the community but the Project lead reserves the right to be a benevolent dictator to maintain institutional support for the project.
Get Involved
We encourage community participation. Hereās how you can get involved:
A couple of ideas of this could look at to make data analysis-ready:
Analysis ready data is more important at higher resolutions and with larger volumes of data. For COSIMA, are there common 3d+daily ocean fields that folks struggle to analyse due to slow processing?
Can we spend time to make analysis more agnostic of the data-product used? Whilst lots of data products define their own names for variables, compliance with CF-standards is reasonably common, so can the cf-compliance be leveraged to access data through its CF attributes?
3D+daily ocean fields are indeed a place where Iāve been forced to think about making ARD collections to get results. In my case thatās recently been the BRAN2020 (1993-2023) reanalysis.
Can I echo your question @anton: what 3D+daily ocean fields do current COSIMA staff and students care about? @adele157@PSpence@edoddridge et al?
Also noting that as work spins up āto identify the mechanisms linking modes of climate variability to weather regimesā the need to move to higher frequency data is likely.
Even in an ocean that is āslowā compared to the atmosphere, monthly output may no longer be āenoughā for applications? And if, in terms of those mechanisms, you care about anything below the ocean surface then 3d+daily ocean fields may become higher priority in the near future?
One candidate for 3D daily (or even sub-daily) fields is volume and tracer transport and decomposing these into different eddy/mean components. @adele and @claireyung have both done some very impressive work in this space.
Thanks Ed! It would be helpful to have some processing of data for eddy-mean decompositions - that could make analysis much easier, especially for the 1/10th degree and higher resolution models. I had a think about this but Iām struggling with the balance between usefulness to a large group of people vs being specific enough that it improves efficiency.
For example: to compute eddy-mean transport across an isobath or SSH contour binned in density space, you need a huge amount of 3D daily data. Usually for this type of computation I ran lots of parallel jobs on gadi (for each month, say e.g. cosima-recipes/Tutorials/Submitting_analysis_jobs_to_gadi.ipynb at main Ā· COSIMA/cosima-recipes Ā· GitHub), saved the output and then combined it later (I suppose this splitting up is a project-specific analysis-ready workflow). Since you need to choose your variable, contour and density bins, most of the data is very project-specific.
Potentially rechunking and saving some data only at particular latitude bands might help Antarctic (or other regions) transport calculations by reducing the memory overhead (I have no idea of the amount of speed-up you get from this, though), and doing some general conversions of thermodynamic variables e.g. practical salinity to absolute salinity or daily T and S to different pot_rhoās might skip a few steps for people doing density binning but requires a lot of space to save.
Maybe computing EKE from daily 3d velocities for say each month/year/decade (noting you pick up different processes in each time period) of the RYF/IAF is a good candidate for something people use that is fairly general?
Thanks @claireyung. Could you flesh out the detail here ( assume the reader knows nothing ) either in an issue ( COSIMA repo or somewhere else public ) or just here in a reply?