Experiment Proposal: Processing Global km-scale Hackathon Data

Experiment title :bell:: Global km-scale Hackathon Data

Summary :bell::

In May 2025 21st Century Weather, together with NCI and ACCCESS-NRI took part in the Global km-scale Hackathon km-Scale Hackathon | HK25 homepage. Several simulations were run by different global teams covering 2020-03-01 to 2021-03-01. We transferred 3 models to Gadi: ICON Global, 2.5 km, UM Global, 5 km & 10 km, SCREAM Global, 3.25 km / 128 levels totalling ~100Tb in Zarr format.
Project qx55 is where this data is located and where it will be stored long-term. However before it is ready for long-term storage significant processing needs to be carried out to reduce inode usage (better chunking/sharding) and storage usage (higher compression). SUs are needed for this and some storage is required as a staging ground.

Scientific motivation:

There were many different science goals for this data: km-Scale Hackathon | HK25 homepage
For this ML group the motivation is that the data was very hard to get, it took a month of continuous transfers. It is of very high resolution and global, and could be used to train an AI emulator for downscaling. There have also been several interesting data-driven models that trained on dataset with this type of grid and show good performance.

Experiment Name :bell::
People :bell:: Sam Green @sam.green
Model: UM, ICON, SCREAM
Configuration:
Initial conditions:
Run plan:
Simulation details: km-Scale Hackathon | HK25 homepage
Total KSUs required :bell:: 50kSU
Total storage required :bell:: 50TB
Storage lifetime :bell:: High chance just this quarter; Low chance the next quarter too.
Long term data plan :bell:: qx55 with further application to NCI or ACCESS-NRI to host qx55
Outputs: UM/ICON/SCREAM Zarr datasets with optimised storage/inode usage.
Restarts:
Related articles:

Analysis:

GitHub - digital-earths-global-hackathon/tools: Pre-processing and preparation of data/ simulations. has notebooks to show how to analyse the data.
Digital Earths Global Hackathon Data Catalog is a catalogue containing all datatsets that were available during the Hackathon.

Conclusion:

The Hackathon was very successful and has produced large datasets that provide benefits to the Australian community for simulation analysis and training ML models. If there are other models/datasets in Digital Earths Global Hackathon Data Catalog that may also be useful to this ML community then we can transfer and store them too.

1 Like

Hi Sam, the dataset is definitely valuable. The co-chairs have approved your request and mentioned that, if possible, you could make it more generally available, for example on the Thredds server.

1 Like