Utilising GitHub effectively for experiments and configurations

payu is the run-tool used for the access-om2, access-esm and access-om3 models.

payu uses git to track changes to experiment configurations and ACCESS-NRI GitHub to distribute model configurations

Typically a user might clone a configuration to gadi, modify as required and run their own experiment. Ideally they would also git push their experiment control repository back to GitHub where it is discoverable by others.

Discovery is good for open science, it allows others to build on what was done before, but is also a crucial part of accountability and replicability. Access to the details of how the experiment was run is also a vital part of experiment provenance.

Some questions:

Have you put your experiments on GitHub?

If not, what prevents you from putting your experiment on GitHub?

What would be required to make this easier?

Do you see value in making the effort to do this? Is there a good way for others to find your experiment repositories?

Hi Aidan,
I wanted to get back to you about this (hectic last few weeks). My thoughts about GitHub for storing experiments:

  • I initially tried to share some experiment configurations with another user using GitHub. I quickly ran into the filesize limit. To properly share all the inputs needed for a complete experiment package requires several GB.
  • I tried also using GitHub Large File Storage to store the “big” files. This also didn’t work because the “Large” file limit is only 1 GB, and if you hit that limit your repository gets frozen and it’s frustrating.
  • For sharing a complete experiment package (especially at the end when publishing results), I intend to use Zenodo to publish and share input configurations. I have used Zenodo before and it’s great, but once you put something there it’s meant to be there forever, so I don’t use it for things that are halfway done. Granted, you can update a Zenodo repo to a later version, but still I prefer not to “publish” in that way until I’m more or less dealing with a final version of sorts.
  • My understanding of payu is that it is designed to store just the “run” directory on GitHub. I could potentially adopt an approach like that. But, this approach still relies on saving a lot of the really important stuff on a gdata directory.
  • Case in point here: I initialised my first ACCESS-ESM1.5 experiments from the publicly available github repository for the pre-industrial configuration. That worked great in early 2023. But, in early 2024, when I instructed another user to follow the same procedure, the experiment didn’t work anymore. Something had changed on the gdata repo… I believe it was the atmosphere restart file, and it crashed. So we reverted to using my “old” files that I had saved on gdata, and that worked again. I’m sure this is something that ACCESS-NRI is trying to correct with its new release… I just mention this as an example of GitHub being insufficient to properly save and share a full experiment configuration.
  • In general, I still experience some friction using GitHub. I think it’s awesome for sharing code, and I’m really trying to use it in that fashion. (I have shared examples of such on the Hive.) I also think that other researchers out there experience similar friction with basic things like: (a) How to create a repository, (b) how to branch a repository, (c) when should I branch / fork / start a new repository?

So, in the upshot of that, I will try to put my run directories into GitHub repositories so that my control files and config files etc are stored properly for sharing with others. But it would make a big difference if we could actually store large files in the GitHub Large File Storage. I’m curious to know how ACCESS-NRI deals with the large file problem.

Cheers, Dave

2 Likes

Thanks @dkhutch that is very helpful and informative.

Yes that is the correct. payu is designed to only save the text based configuration files in the git repo (the experiment configuration repository).
As you outline GitHub has limits on the size of files that can be uploaded to a repository, even with LFS. git is designed to work with text based files that can be diff’ed, and that is what it works with best, and not binary files.

Agreed. It is necessary but insufficient. We don’t have a good way of portably wrapping and referring to binary input files. What we do have is a complete history of all the files used in an experiment through the automatically generated and tracked manifests.

We would like to have a better solution for sharing experiment artefacts (inputs etc). But if you are staring within the same system, e.g. gadi, it should be straightforward, but we could create some tooling to make it even easier, e.g. copy all the files from a manifest into another location, or make a tarfile of all those files.

These are great questions. We’ll try and make a topic to address some of these questions and add to it if there are other follow up questions.

Hi @Aidan,
That’s good to know the information about the hash systems that payu auto-generates. Here’s a scenario for you:
Let’s say I have a collection of 4 different experiments which are all closely related but with minor perturbations made between each. E.g. I have a Miocene topography, and I’m running with 4 different levels of CO2. If I want to “preserve” these on GitHub, I’m guessing the easiest way is to simply make 4 different GitHub repositories, for these slight variations in forcing? Maybe it’s possible to have a single repo with 4 different branches so they are “grouped” together… but I feel like trying to do that will create more headaches than it’s worth for an amateur GitHub user like myself.

Should I just make 4 different repos? And then worry about merging and grouping stuff later?

Cheers, Dave

Well I have good news, payu now explicitly supports branching to make just such a scenario relatively easy to do.

I wrote a tutorial for ACCESS-OM2 but it covers your use case:

In particular the section B. Linked Experiments describes exactly what you want to do: create a series of linked experiments all branching from the same point in the experiment (same commit).

You’ll need to use the version of payu in vk83 to access these features. But all are welcome, come and join!

Ok I just had a quick read of the linked tutorial. It seems that if you want to run branched experiments simultaneously, then you need to clone them into separate control directories. I think I’ll avoid that. Would rather just make separate repos for my purposes.

I read this part:

Things to avoid

Don’t clone the access-om2-configs repo without renaming it.
Don’t then checkout an experiment branch and payu run.

Ahaha I’ve been doing this for ages (with the ESM pre-industrial configuration). But never mind, I will try to give my repos appropriate names from now on.

There is nothing to stop you keeping related experiments in the same repo, just make cloned copies in multiple directories. It’s effectively the same approach you’re using, but having related experiments in the same repo makes it easier to compare and see the commonalities and differences.

Note that this advice is specifically for the ACCESS-NRI model config repos which have names like access-om2-configs and access-esm1.5-configs.

So you don’t want to clone that repo and just create a branch to run from, as the experiment will be called access-esm1.5-configs-<branchname>-<jobid> which is not good.

So I don’t think applies to your current use case.

Ok thanks Aidan. I will try.

1 Like