Utilising GitHub effectively for experiments and configurations

payu is the run-tool used for the access-om2, access-esm and access-om3 models.

payu uses git to track changes to experiment configurations and ACCESS-NRI GitHub to distribute model configurations

Typically a user might clone a configuration to gadi, modify as required and run their own experiment. Ideally they would also git push their experiment control repository back to GitHub where it is discoverable by others.

Discovery is good for open science, it allows others to build on what was done before, but is also a crucial part of accountability and replicability. Access to the details of how the experiment was run is also a vital part of experiment provenance.

Some questions:

Have you put your experiments on GitHub?

If not, what prevents you from putting your experiment on GitHub?

What would be required to make this easier?

Do you see value in making the effort to do this? Is there a good way for others to find your experiment repositories?

Hi Aidan,
I wanted to get back to you about this (hectic last few weeks). My thoughts about GitHub for storing experiments:

  • I initially tried to share some experiment configurations with another user using GitHub. I quickly ran into the filesize limit. To properly share all the inputs needed for a complete experiment package requires several GB.
  • I tried also using GitHub Large File Storage to store the “big” files. This also didn’t work because the “Large” file limit is only 1 GB, and if you hit that limit your repository gets frozen and it’s frustrating.
  • For sharing a complete experiment package (especially at the end when publishing results), I intend to use Zenodo to publish and share input configurations. I have used Zenodo before and it’s great, but once you put something there it’s meant to be there forever, so I don’t use it for things that are halfway done. Granted, you can update a Zenodo repo to a later version, but still I prefer not to “publish” in that way until I’m more or less dealing with a final version of sorts.
  • My understanding of payu is that it is designed to store just the “run” directory on GitHub. I could potentially adopt an approach like that. But, this approach still relies on saving a lot of the really important stuff on a gdata directory.
  • Case in point here: I initialised my first ACCESS-ESM1.5 experiments from the publicly available github repository for the pre-industrial configuration. That worked great in early 2023. But, in early 2024, when I instructed another user to follow the same procedure, the experiment didn’t work anymore. Something had changed on the gdata repo… I believe it was the atmosphere restart file, and it crashed. So we reverted to using my “old” files that I had saved on gdata, and that worked again. I’m sure this is something that ACCESS-NRI is trying to correct with its new release… I just mention this as an example of GitHub being insufficient to properly save and share a full experiment configuration.
  • In general, I still experience some friction using GitHub. I think it’s awesome for sharing code, and I’m really trying to use it in that fashion. (I have shared examples of such on the Hive.) I also think that other researchers out there experience similar friction with basic things like: (a) How to create a repository, (b) how to branch a repository, (c) when should I branch / fork / start a new repository?

So, in the upshot of that, I will try to put my run directories into GitHub repositories so that my control files and config files etc are stored properly for sharing with others. But it would make a big difference if we could actually store large files in the GitHub Large File Storage. I’m curious to know how ACCESS-NRI deals with the large file problem.

Cheers, Dave

2 Likes