Experiment Proposal: Processing Global km-scale Hackathon Data

Experiment title :bell:: Global km-scale Hackathon Data

Summary :bell::

In May 2025 21st Century Weather, together with NCI and ACCCESS-NRI took part in the Global km-scale Hackathon km-Scale Hackathon | HK25 homepage. Several simulations were run by different global teams covering 2020-03-01 to 2021-03-01. We transferred 3 models to Gadi: ICON Global, 2.5 km, UM Global, 5 km & 10 km, SCREAM Global, 3.25 km / 128 levels totalling ~100Tb in Zarr format.
Project qx55 is where this data is located and where it will be stored long-term. However before it is ready for long-term storage significant processing needs to be carried out to reduce inode usage (better chunking/sharding) and storage usage (higher compression). SUs are needed for this and some storage is required as a staging ground.

Scientific motivation:

There were many different science goals for this data: km-Scale Hackathon | HK25 homepage
For this ML group the motivation is that the data was very hard to get, it took a month of continuous transfers. It is of very high resolution and global, and could be used to train an AI emulator for downscaling. There have also been several interesting data-driven models that trained on dataset with this type of grid and show good performance.

Experiment Name :bell::
People :bell:: Sam Green @sam.green
Model: UM, ICON, SCREAM
Configuration:
Initial conditions:
Run plan:
Simulation details: km-Scale Hackathon | HK25 homepage
Total KSUs required :bell:: 50kSU
Total storage required :bell:: 50TB
Storage lifetime :bell:: High chance just this quarter; Low chance the next quarter too.
Long term data plan :bell:: qx55 with further application to NCI or ACCESS-NRI to host qx55
Outputs: UM/ICON/SCREAM Zarr datasets with optimised storage/inode usage.
Restarts:
Related articles:

Analysis:

GitHub - digital-earths-global-hackathon/tools: Pre-processing and preparation of data/ simulations. has notebooks to show how to analyse the data.
Digital Earths Global Hackathon Data Catalog is a catalogue containing all datatsets that were available during the Hackathon.

Conclusion:

The Hackathon was very successful and has produced large datasets that provide benefits to the Australian community for simulation analysis and training ML models. If there are other models/datasets in Digital Earths Global Hackathon Data Catalog that may also be useful to this ML community then we can transfer and store them too.

Hi Sam, the dataset is definitely valuable. The co-chairs have approved your request and mentioned that, if possible, you could make it more generally available, for example on the Thredds server.