Running Model Experiments with payu and git

Introduction

This material is designed for a 3 hour (2x1.5 hr sessions) payu tutorial as part of the ACCESS Workshop 2024.

Goals

  • Learn how to obtain and run an experiment with an ACCESS supported model configuration using the payu scientific workflow management tool.
  • Become familiar with creating new experiments by altering existing configurations.
  • Learn how to curate a multi-experiment git repository
  • Develop skills sharing configurations (and experiments) with GitHub
  • Understand experiment provenance, and how payu and git enable extensive experimental provenance

Requirements

Terminology

Climate modelling is complicated. We need a shared vocabulary to enable a shared understanding. In this training these are some of the terms we’ll use and what they mean:

Term Meaning
model A combination of model components, compiled and deployed by ACCESS-NRI
model component One discrete component of a multi-component system, e.g. MOM5 ocean
model configuration A git repository containing a model configuration
experiment A specific realisation (series of runs) of a model configuration

For example, the ACCESS-ESM1.5 model is versioned, built and deployed from the ACCESS-ESM1.5 repository. It has four major model components (see the Hive Docs for details), and two supported model configurations.

Workflow management

Payu is a workflow management tool for running numerical models in supercomputing environments, and it is the tool used to run the ACCESS models covered in this tutorial.

Payu is written in python. See the documentation or the GitHub repository for more information.

The latest version is 1.1.5, and that is the minimum required for this tutorial. ACCESS-NRI provides supported conda environments for payu, which also contains other dependencies and tools required to run ACCESS-NRI supported models. These can be accessed via the module system:

module use /g/data/vk83/modules
module load payu/1.1.5

and it is a requirement that this environment is loaded for all the subsequent steps in this tutorial.

Details of the latest version of payu are available in the release notes which are updated when new supported versions of payu are released.

Experiment provenance

Detailed provenance allows researchers to better understand, reproduce, and validate their simulation experiments.

Experiment provenance for computer simulations involves documenting the entire lifecycle of an experiment to ensure transparency, reproducibility, and validation.

Some key components:

  • Model Information: Details about the model and model components used, source code versions, parameters, and configuration.
  • Input Data: Information on the data fed into the simulation, such as initial conditions, forcing data, and preprocessing steps.
  • Execution Environment: Specifications of the hardware and software environment, including operating systems, libraries, and dependencies.
  • Model Runs: Records of each run, including timestamps, input parameters, and any variations between runs.
  • Output Data: Documentation of the results generated by the simulation, including formats, storage locations, and any post-processing steps.
  • Authorship and Contributions: Information about the individuals who conducted the simulation, their roles, and contributions

git and payu

Git is a distributed version control system that is used “under the hood” by payu to track experiment provenance.

Git was designed to enable fast and efficient version control of computer source code, which is typically a directory tree containing text files.

A payu model configurations is a number of configuration files that control how a model and its components are run, i.e. a directory tree containing text files. So a good fit for git.

Some key features of git that are particularly relevant for use with payu and experiment provenance:

  • Commit History: Track changes with detailed commit messages, timestamps, and author information.
  • Security: cryptographic methods used to ensure integrity of the version history.
  • Branching and Merging: Easily create, manage, and merge branches for parallel development.
  • Distributed Version Control: Every user has a complete copy of the repository, including its full history.
  • Collaboration: Supports multiple users working on the same project simultaneously.

Payu utilises manifests (formatted text files) to uniquely identify all executables, model inputs and restart files. Payu adds these manifests to the files that are tracked by git.

The combination of payu and git to track and version experimental configurations satisfies many provenance requirements: authorship, model runs, inputs and model information.

Payu uses git branching to support multiple experiments in the same repository.

Support for distributed version control and collaboration allows researchers to easily and effectively share their work. This is good for researchers and good for science.

Running a (long) experiment

The first step of the tutorial is to start a standard run of model configuration so that it has time to finish.

ssh into gadi. Make sure you’ve loaded payu/1.1.5 (see above)

Create a directory for all the training material in your home directory and change directory into it

mkdir  ~/payu-training/
cd  ~/payu-training/

Clone experiment

Exercise: Clone a released configuration to a new experiment directory.

Choose either 1 deg ACCESS-OM2 RYF or pre-industrial ACCESS-ESM1.5 and clone to a new experiment directory and branch called control (see the ACCESS-Hive Docs for OM2 or the same for ESM1.5 for instructions on how to do this).

Answer

ACCESS-OM2:

payu clone -b control -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs 1deg_jra55_ryf-training

ACCESS-ESM1.5:

payu clone -b control -B release-preindustrial+concentrations https://github.com/ACCESS-NRI/access-esm1.5-configs preindustrial+concentrations-training

Run experiment

Exercise: Change project used to run the model to nf33 and do a single run

Changing the project code used to run a model is covered in the Hive Docs for ESM1.5 and OM2 (though it is essentially identical).

Running a model is also covered in Hive Docs:

:bulb: Tip
It is always a good idea to run payu setup after cloning a new configuration or making substantial changes. This runs just the setup phase, which tests access to all the paths required to set up a model run, and updates manifests which payu uses to determine which storage mounts need to be included in a PBS submission. However you either need to run payu sweep before running, or use the -f option when running.

Answer
payu setup
payu sweep
payu run

This will take over an hour, so we move on and re-visit later in the tutorial.

Run another (short) experiment

While the long experiment is running, we use this time to clone a new experiment and change model run length to shorter run times. We will also look into configuring syncing and restart pruning with payu.

Exercise: Clone a new experiment

Firstly create a new clone of the same experiment as above with a different directory and branch name. This must be a separate clone into a different directory because we’ll be running multiple experiments simultaneously, and only one experiment can be run at a time in a given control directory.

  1. Change to training directory ~/payu-training
  2. Create a new clone of the experiment with a different directory and branch name.
  3. Change to new experiment control directory.
Solution
 cd ~/payu-training
  1. This could look something like the following:

ACCESS-OM2

payu clone -b sync-and-restart-pruning-expt -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs 1deg_jra55_ryf-training-2

ACCESS-ESM1.5

payu clone -b sync-and-restart-pruning-expt -B release-preindustrial+concentrations https://github.com/ACCESS-NRI/access-esm1.5-configs preindustrial+concentrations-training-2

Where -b is the new branch name, and the name at the end of the command is the new directory name.

  1. Using the above solution examples:
cd 1deg_jra55_ryf-training-2

or

cd preindustrial+concentrations-training-2

Editing config.yaml file (optional)

In this section, we will modify the config.yaml file in the control directory. This payu configuration file controls the general model configuration. Editing this file can be done via your favourite editor, for example, vim or vscode. If you are new to editing files using the command-line, an option could be to use Nano, as it keeps a menu of possible command options at the bottom of the editor. This is an optional section, feel free to skip if you are comfortable editing files from the terminal on gadi.

Exercise (Optional): Using Nano to edit `config.yaml` files
  1. To open config.yaml in Nano, run the following command:
nano config.yaml
  1. Navigate through the file using arrow keys, and start typing to insert text.
  2. To close the editor, press Ctrl + X. When there are changes, Nano will prompt you if you want to save or discard changes. To save the changes, press Y and Enter to confirm and save.

Exercise: Configure run length

Changing the run length requires opening configuration files in a text editor, making changes and saving those changes.

ACCESS-OM2

Using the Hive Docs guide on how to change run length for ACCESS-OM2, change the run length to 1 month.

Solution

The run length is controlled by the restart_period field in the &date_manager_nml section of the accessom2.nml file:

&date_manager_nml
    forcing_start_date = '1958-01-01T00:00:00'
    forcing_end_date = '2019-01-01T00:00:00'<br>
    ! Runtime for a single segment/job/submit, format is years, months, seconds,
    ! two of which must be zero.
    restart_period = 5, 0, 0
&end
  1. Open accessom2.nml in a text editor. If using Nano, this will be:
nano accessom2.nml
  1. Change the run length to 1 month:
restart_period = 0, 1, 0
  1. Close and save the file.

ACCESS-ESM1.5

Using the Hive Docs guide on how to change run length for ACCESS-ESM1.5 and running less than a year, change the run length to 1 month.

Solution

The length of an ACCESS-ESM1.5 run is controlled by the runtime settings in the config.yaml file:

    runtime:
        years: 1
        months: 0
        days: 0

Run length for ACCESS-ESM1.5 experiments usually should be left at 1 year to avoid errors. However, for this exercise, we use a shorter run length so the model does not take as long to run. This requires an additional change to the sea ice model configuration so that restart files are produced at monthly frequencies.

  1. Open config.yaml in a text editor. If using Nano,
nano config.yaml
  1. Change the run length to 1 month:
    runtime:
        years: 0
        months: 1
        days: 0
  1. Close and save the file.
  2. Open ice/cice_in.nml configuration file.
  3. Change dumpfreq = 'y' setting to dumpfreq = 'm'
  4. Close and save the file.

Exercise: Commit changes

It is always a good idea when making changes to a model configuration to git commit them with an informative message about why the change has been made.

Make a git commit with the message “Reduced run time to 1 month for testing”

Answer
git commit -a -m 'Reduced run time to 1 month for testing'

Configuring Restart Pruning

There are restart files for every run to allow subsequent runs to start from a previously saved model state. These restart files can occupy a significant amount of disk space. By default, payu keeps the restart files for every fifth run and “prunes” (deletes) the rest. For example, say if a model had run 11 times, the restarts in the archive directory would be:

restart000
restart005
restart010
Click for more detail

Intermediate restarts

Intermediate restarts are retained and are only deleted after a permanently archived restart files have been produced.

So when the model has been run 15 times, the restarts in the archive directory would be:

restart000
restart005
restart010
restart011
restart012
restart013
restart014

After the 16th model run, these intermediate restarts are deleted as it has reached a permanently archived checkpoint - restart015.

restart000
restart005
restart010
restart015

restart_freq

The rate at which restart files are pruned is controlled by restart_freq in config.yaml

This can either be an integer or date-based frequency. For example to save all restart files, the setting in config.yaml would be:

restart_freq: 1

Using date-based restart frequency is useful because it makes restart pruning independent of model run length. If the model run length is modified during the course of an experiment the frequency with which restarts are pruned is unaffected.

This is covered in the Hive Docs for ESM and OM2.

Exercise: Set restart pruning frequency

We will set a frequency small enough that we can see changes to archive over a short run

Edit config.yaml to change the restart_freq to keep the first restart date-time every 2 months.

Solution

The config.yaml should contain:

restart_freq: 2MS

:bulb: Hint: To always keep the last N restarts, in addition to the permanently saved restarts determined by restart_freq, you can set restart_history in config.yaml. So restart_history: 5 keeps the last 5 restarts.

Configuring sync

By default, payu laboratory and archive directories are in /scratch storage. The scratch filesystem is temporary storage where files not accessed for 100 days will be automatically removed. For this reason, it is often required for model outputs to be moved to /g/data/ for long-term data storage.

Payu has some syncing support using rsync commands under the hood which runs in a separate PBS job. If automatic syncing is enabled, this job is submitted after the collation PBS job, if collation is enabled, or after the payu archive step in the run PBS job. In the case of both ACCESS-OM2 and ACCESS-ESM1.5 configurations, it is submitted after the collation has been completed.

Sync options

As there are several configurations options for syncing, it has its own subsection in config.yaml under sync. The main options are:

  • enable (Default: False) - Controls whether or not a sync job is submitted automatically
  • path - Destination path to sync archive outputs to. NOTE: This must be a unique absolute path for your experiment, otherwise, outputs will be overwritten.
  • restarts (Default: False) - Sync permanently archived restarts determined by restart_freq.

Sometimes it’s useful to remove files from the archive after they have been successfully synced to save space. There are two levels of options:

  • remove_local_files (Default: False) - This deletes files after they have synced but will leave behind empty directories and files that were excluded from the sync commands.
  • remove_local_dirs (Default:False) - This removes output and restart directories.

Both of the above will not delete files and directories from the last output. If restarts have been synced, the last saved restart (determined by `restart_freq’) and any subsequent restarts will also not be deleted.

Because sync is run as a separate PBS job, it has several configurable PBS settings. For example, queue controls what PBS queue it runs on which is by default copyq. If there’s additional post-processing to be run before syncing to a remote archive, there is a sync user-script option. This can run a script or command at the start of sync PBS job before any rsync commands are run.

A full list of sync options can be found under Post-proccessing in payu documentation

Exercise: Set sync parameters

Enable sync in config.yaml and configure remote archive directory

/scratch/nf33/<replace-with-user-id>/tmp/test-sync-experiment-archive

where <replace-with-user-id> is your NCI user-name. Restarts should also be synced.

Solution

The sync subsection in config.yaml should look similar to the following:

sync:
    enable: true
    restarts: true
    path: /scratch/nf33/<replace-with-user-id>/tmp/test-sync-experiment-archive

Run experiment

To obtain several output and restart directories, we will need to run the model a number of times.

Exercise: Run experiment 6 times

See the ESM and OM2 Hive docs for information on how to run the models multiple times.

Answer
payu setup
payu sweep
payu run -n 6

payu setup checks the restart_freq is set to a valid value
payu sweep removes the work directory generated by payu setup
The -n flag sets the number of runs to be performed.

:exclamation: Note: The above will run 6 model execution jobs in 6 separate PBS job submissions for both ACCESS-OM2 and ACCESS-ESM1.5 configurations. The number of runs per submission can be modified by setting runspersub in config.yaml, which defines the maximum number of runs for each payu submission.

PBS error and output log files

The control directory contains all the PBS logs for each job. After 6 runs for the ACCESS-ESM1.5 pre-industrial configuration , the control directory looks like:

$ ls # ls ~/payu-training/preindustrial+concentrations-training-2
archive        pre-industria_c.e123895132  pre-industria_c.o123897744  pre-industrial.o123894927   pre-industria_s.e123897850  README.md                        UM_conversion_job.sh.o123895276
atmosphere     pre-industria_c.e123895450  pre-industria_c.o123898409  pre-industrial.o123895133   pre-industria_s.e123898819  scripts                          UM_conversion_job.sh.o123896556
config.yaml    pre-industria_c.e123896742  pre-industria_c.o123899069  pre-industrial.o123895452   pre-industria_s.e123899300  testing                          UM_conversion_job.sh.o123896984
coupler        pre-industria_c.e123897744  pre-industrial.e123894927   pre-industrial.o123896743   pre-industria_s.o123895313  UM_conversion_job.sh.e123895276  UM_conversion_job.sh.o123897849
ice            pre-industria_c.e123898409  pre-industrial.e123895133   pre-industrial.o123897745   pre-industria_s.o123896557  UM_conversion_job.sh.e123896556  UM_conversion_job.sh.o123898817
LICENSE        pre-industria_c.e123899069  pre-industrial.e123895452   pre-industrial.o123898410   pre-industria_s.o123896987  UM_conversion_job.sh.e123896984  UM_conversion_job.sh.o123899298
manifests      pre-industria_c.o123895132  pre-industrial.e123896743   pre-industria_s.e123895313  pre-industria_s.o123897850  UM_conversion_job.sh.e123897849
metadata.yaml  pre-industria_c.o123895450  pre-industrial.e123897745   pre-industria_s.e123896557  pre-industria_s.o123898819  UM_conversion_job.sh.e123898817
ocean          pre-industria_c.o123896742  pre-industrial.e123898410   pre-industria_s.e123896987  pre-industria_s.o123899300  UM_conversion_job.sh.e123899298

Each PBS job has standout output and error logs which are written out to <jobname>.o<job-ID> and <jobname>.e<job-ID> respectively.

Using the above example there are 4 types of jobs run:

  • pre-industrial: The main model execution job where init, setup, run, and archive stages run. At the end of this job, it submits the collate job.
  • pre-industria_c: Runs the collation. Once the collate stage has run, it submits the postscript and sync jobs. Note: _c marks payu collate job logs.
  • pre-industria_s: Runs the syncing to remote archive. Note: _s marks payu sync job logs.
  • UM_conversion_job.sh: This is the postscript job - a user-defined post-processing script
Exercise: Find the Service Units and Walltime for a single run

Monitor your run: see Hive Docs for ESM and OM2 on how to do this.

When your first job completes examine the PBS output log file and find the Services Units and Walltime used.

:bulb: Hint: the commands cat and less are useful ways to view a text file

Answer

For example (your PBS extension will be different)

cat pre-industrial.o123894927
======================================================================================
                  Resource Usage on 2024-08-29 10:54:22:
   Job Id:             123894927.gadi-pbs
   Project:            tm70
   Exit Status:        0
   Service Units:      100.05
   NCPUs Requested:    384                    NCPUs Used: 384
                                           CPU Time Used: 44:51:57
   Memory Requested:   1.5TB                 Memory Used: 152.35GB
   Walltime requested: 02:30:00            Walltime Used: 00:07:49
   JobFS requested:    800.0MB                JobFS used: 8.16MB
======================================================================================

So service units = 100 and Walltime used is 7m49s.

Local Archive

After the 6 sequential runs, we expect the archive to look like the following:

$ ls archive/ # List directories under the archive symlink in the control directory
metadata.yaml  output000  output001  output002  output003  output004  output005  pbs_logs  restart000  restart002  restart004  restart005

There should be 6 output directories and restarts are pruned at 2-month intervals. Note that intermediate restart restart005 will be kept until there is a restart that has a date-time later than the beginning of the 2nd month from restart004.

Remote Archive

After 6 sequential runs with syncing enabled, the remote archive should contain the following:

$ ls /scratch/tm70/<replace-with-user-id>/tmp/test-sync-experiment-archive/
git-runlog  metadata.yaml  output000  output001  output002  output003  output004  pbs_logs

:exclamation: Note: If using ACCESS-OM2 configuration, the latest output, output005, will also be synced.

Exercise: Confirm local and remote archive contain correct files

Confirm your model run has completed. List the files in your local and remote archive and check they contain the correct files.

Postscript and Sync

The ACCESS-ESM1.5 experiment runs have an additional PBS post-processing job. The PBS logs for these jobs start with UM_conversion_job.sh. This job converts atmospheric outputs to NetCDF format. It is set in config.yaml under

postscript: -v PAYU_CURRENT_OUTPUT_DIR,PROJECT  -lstorage=${PBS_NCI_STORAGE} ./scripts/NetCDF-conversion/UM_conversion_job.sh

The postscript job is submitted at the same time as the sync job. Currently, when postscript is configured, the last outputs and restarts are not automatically synced because there is no guarantee the postscript job will be finished before the sync job starts. It will however sync all outputsN where N < current run counter, so this means it will sync all previous outputs.

A future improvement to Payu syncing support, could be to add dependency logic to syncing so it waits for the end of the postscript job before running the sync job so the latest output can be synced automatically. So, keep an eye out for future payu releases and updates!

In the meantime, payu sync can be run manually at the end of an experiment to sync the final outputs and restart files to a remote archive.

Exercise: Manually run payu sync

With sync configured, you can manually submit sync jobs using the payu sync command. Using payu sync will sync all output directories.

In this exercise, we will modify the sync sub-section in config.yaml but wait until all jobs from the previous exercise have been completed.

Set remove_local_dirs to true to enable local archive deletion of synced output and restart directories, then sweep to copy logfiles to archive and sync

Answer

The sync section in config.yaml should now look something like:

sync:
  enable: true
  path: /scratch/tm70/<your-user-id>/tmp/test-sync-experiment-archive
  restarts: true
  remove_local_dirs: true

Then run:

payu sweep
payu sync

Once the sync job has been completed, check the remote archive. It should now contain all the outputs and all the permanently saved restart directories:

$ ls /scratch/tm70/<your-user-id>/tmp/test-sync-experiment-archive/
git-runlog  metadata.yaml  output000  output001  output002  output003  output004  output005  pbs_logs  restart000  restart002  restart004

:exclamation: Note: restart005 is not synced as it is an intermediate restart directory

Check the local archive directory. Every output that isn’t the latest output should be deleted, and every restart that isn’t the last permanently saved restart and intermediate restart should also be deleted.

$ ls archive/ 
metadata.yaml  output005  restart004  restart005

Exercise: Sync the entire archive

To sync all restarts, you can add --sync-restarts flag to payu sync. This is particularly useful when an experiment is finished, or temporarily halted for some time, to make sure all outputs and restarts have been copied to non-ephemeral storage.

  1. Wait until all jobs from the previous exercise have completed.
  2. Run payu sweep to move all log files to archive directory
  3. Run payu sync --sync-restarts command
  4. When the job has completed (when log files for sync job have been created), check the remote archive. We expect it to contain all outputs and restarts:
$ ls /scratch/tm70/<replace-with-user-id>/tmp/test-sync-experiment-archive/
git-runlog  metadata.yaml  output000  output001  output002  output003  output004  output005  pbs_logs  restart000  restart002  restart004  restart005

Collaboration with GitHub

Why collaborate?

Just some examples of what collaboration can do:

  • Save time and resources by avoiding wasteful duplication or known pitfalls or errors
  • Assist inexperienced researchers to become productive faster
  • Bring new skills and perspectives from other disciplines

What is GitHub?

From wikipedia:

GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project.

Why GitHub?

  • World’s largest source code host (>100 million developers, >420 million repositories)
  • Free for open source projects
  • Built in support for automated CI/CD (GitHub Actions)
  • Easy cloning: just fork a repo
  • Visibility and documentation of issues
  • Visibility of fixes and support for code reviews via pull-requests

Introducing gh

This tutorial uses the GitHub command line interface (CLI) client gh to interact with GitHub

GitHub is web site, and it is possible to use the web interface to do things like create a repository, but it is simpler for this purpose to provide commands that can be used directly on gadi.

Authorise with GitHub

gh is included in payu modules supported by ACCESS-NRI. As long as the payu command is available gh should be also.

The first step is to authorise with GitHub:

gh auth login

This will prompt for a series of responses. Select the responses used below:

? What account do you want to log into? GitHub.com
? What is your preferred protocol for Git operations on this host? HTTPS
? Authenticate Git with your GitHub credentials? Yes
? How would you like to authenticate GitHub CLI? Login with a web browser

! First copy your one-time code: XXXX-XXXX
Press Enter to open github.com in your browser... 

At this point you will get an error opening a browser on gadi:

! Failed opening a web browser at https://github.com/login/device
  exec: "xdg-open,x-www-browser,www-browser,wslview": executable file not found in $PATH
  Please try entering the URL in your browser manually

So open Open https://github.com/login/device in your browser, authenticate with GitHub if you’re not already logged in, copy the one-time code from your terminal window and paste it in. Then authentication should complete:

✓ Authentication complete.
- gh config set -h github.com git_protocol https
✓ Configured git protocol
! Authentication credentials saved in plain text
✓ Logged in as xxxxxxxxxxx

To check status use gh auth status

$ gh auth status
github.com
  ✓ Logged in to github.com account xxxxxxxxx (/home/XXX/xxxXXX/.config/gh/hosts.yml)
  - Active account: true
  - Git operations protocol: https
  - Token: gho_************************************
  - Token scopes: 'gist', 'read:org', 'repo'

Perturbation

In this section we will clone from a pre-existing control experiment, and create related perturbation experiments from the same control directory.

Branches

There is now support in payu to run multiple related experiments from the same control directory, though only one experiment can be running at any one time. To distinguish between branches in work and archive directories, payu combines the directory name, the branch name and the first 8 digits of the experiment UUID. We will refer to this as the experiment name.

To change between branches in a control directory, use payu checkout. This is a wrapper around git checkout, and it sets up the archive and work directory symlinks. To checkout and create a new branch, use -b command-line flag.

By default with payu checkout -b, it will use the current branch as a base. It start from an earlier commit or branch, add this to the end of the command. For example,

payu checkout -b <new-branch-name> <base-commit-or-branch-name>

Similarly, to payu clone, use --restart/-r to specify restart path to start the model run from. This adds the restart option to config.yaml which is used as the starting point of a run. This option has no effect if there are existing restart directories, so does not have to be removed for subsequent runs.

For more information, run payu checkout --help or see payu documentation on Metadata and Related Experiments.

Exercise: Clone pre-existing control experiment

There are control experiments available for ACCESS-ESM1.5

and ACCESS-OM2

For this exercise clone one of these repositories into the ~/payu-training directory.

Answer
cd ~/payu-training 

ACCESS-ESM1.5

gh repo clone git@github.com:ACCESS-Community-Hub/access-esm1.5-preindustrial-concentrations-example
cd access-esm1.5-preindustrial-concentrations-example

ACCESS-OM2

gh repo clone ACCESS-Community-Hub/access-om2-1deg_jra55_ryf-example
cd access-om2-1deg_jra55_ryf-example

Exercise: Create first perturbation experiment

To run a perturbation experiment from an existing experiment restart files from the existing experiment are required.

Restarts for the ACCESS-OM2 and ACCESS-ESM1.5 experiments have been copied to

/g/data/nf33/public/training-day-2024/payu-training/experiments/
$ ls -gh /g/data/nf33/public/training-day-2024/payu-training/experiments/
total 8.0K
drwxr-s---+ 9 nf33 4.0K Sep  2 00:48 1deg_jra55_ryf-control-d0683f7e
drwxr-s---+ 8 nf33 4.0K Sep  2 00:59 20240827-release-preindustrial+concentrations-run-0225dcf2

The available restarts dictate where the experiment can be branched from. Examine what restarts are available, choose the second-to-last restart so the experiment is as equilibrated as possible, but there is still available control outputs to compare against our perturbation. Determine the commit hash corresponding to the end of that run.

Checkout a new experiment, perturb1 using the restart path and commit hash determined above. Modify a model parameter and change run length to 1 month. git commit changes with informative commit message

Answer

ACCESS-ESM1.5

Examine the directory for available restarts:

$ ls -gh /g/data/nf33/public/training-day-2024/payu-training/experiments/20240827-release-preindustrial+concentrations-run-0225dcf2/
total 36K
drwxr-s---+ 2 nf33 4.0K Aug 30 21:18 error_logs
-rw-r--r--+ 1 nf33 2.1K Sep  2 01:00 metadata.yaml
drwx--S---+ 2 nf33  12K Sep  1 22:33 pbs_logs
drwx--S---+ 6 nf33 4.0K Aug 27 11:18 restart000
drwx--S---+ 6 nf33 4.0K Aug 28 00:42 restart010
drwx--S---+ 6 nf33 4.0K Aug 30 11:14 restart020
drwx--S---+ 6 nf33 4.0K Sep  2 00:59 restart030

restart020 is the second to last.

Examine the git log, either in the local repo, or on GitHub

Run 20 is commit 0f2e2bb

payu checkout -r /g/data/nf33/public/training-day-2024/payu-training/experiments/20240827-release-preindustrial+concentrations-run-0225dcf2/restart020 -b perturb1 0f2e2bb

ACCESS-OM2

Examine the directory for available restarts:

$ ls -lg /g/data/nf33/public/training-day-2024/payu-training/experiments/1deg_jra55_ryf-control-d0683f7e/
total 32
drwxr-s---+ 7 nf33 4096 Aug 31 11:09 git-runlog
-rw-r-----+ 1 nf33 2254 Sep  1 22:36 metadata.yaml
drwx--S---+ 2 nf33 4096 Sep  2 00:45 pbs_logs
drwx--S---+ 5 nf33 4096 Aug 30 13:34 restart000
drwx--S---+ 5 nf33 4096 Aug 30 18:18 restart004
drwx--S---+ 5 nf33 4096 Aug 30 23:01 restart008
drwx--S---+ 5 nf33 4096 Aug 31 03:46 restart012
drwx--S---+ 5 nf33 4096 Aug 31 08:30 restart016

restart012 is the second to last.

Examine the git log, either in the local repo, or on GitHub

Run 12 is commit 4242995

payu checkout -r /g/data/nf33/public/training-day-2024/payu-training/experiments/1deg_jra55_ryf-control-d0683f7e/restart012 -b perturb1 4242995

Edit config.yaml as before to change model run length.

git commit -a -m 'Modified xx parameter and set run length to one month'

So this branch is now all set up to run a perturbation experiment.

List what branch currently on

payu branch

Display archive symlink and experiment name

ls -l archive

Display git history

git log

Should see a new commit for new experiment UUID as used payu checkout, and should see previous commit was the last commit of the previous run.

To see new metadata.yaml fields, includes experiment name and UUID:

cat metadata.yaml

Run

Now run perturbation experiment for one month

payu setup
payu run -f

:bulb: Hint: a separate payu sweep isn’t necessary if the -f option used for payu run. This automatically removes (sweeps) an existing work directory

Exercise: Create second perturbation experiment

Once the first perturbation experiment is completed create a second perturbation, ideally related to the first in a meaningful way, e.g. opposite sign of change, or a parameter that is orthogonal but physically related.

Repeat steps above:

  1. Checkout new experiment, perturb2. Make sure to checkout from the same base commit as perturb1 and same restarts
  2. Modify model parameter
  3. git commit
  4. Run

Examining branches

Once the second perturbation experiment has completed you should have two experiment branches.

As long as an experiment isn’t running, and any associated post-processing or syncing has completed, it is safe to switch between experiments.

Exercise: list available experiments (branches)

answer
payu branch

Exercise: checkout perturb1

Checkout the first perturbation experiment. Note that the link to the archive directory also changes. payu does this automatically, and is one of the reasons why it is generally better to use payu checkout to switch between branches rather than using git directly.

answer
payu checkout perturb1

Push to repo

Now you can create a repository from your perturbation experiment control directory using

gh repo create

Follow the prompts and enter the information requested. Repository name will default to the directory name of your control directory. The repository owner should be your GitHub username used to authenticate. Choose public visibility. The remote name is just an alias to your repository that git uses when doing a push or pull.

? What would you like to do? Push an existing local repository to GitHub
? Path to local repository .
? Repository name XXXXX
? Repository owner xxxxxxxxx
? Description A nice description of the purpose of the repository
? Visibility Public
✓ Created repository xxxxxxxxx/XXXXX on GitHub
  https://github.com/xxxxxxxxx/XXXXX
? Add a remote? Yes
? What should the new remote be called? myrepo
✓ Added remote git@github.com:xxxxxxxxx/XXXXX.git
? Would you like to push commits from the current branch to "myrepo"? Yes

Push branches to GitHub

If you have other experiment branches you wish to push to the same repo then use:

git push myrepo --all

Compare with control (optional)

If you want to compare your perturbation experiment with the control outputs, they are available here:

/scratch/nf33/public/training-day-2024/payu-training/experiments/

Fork an experiment repo

Exercise: collaborate with a colleague

  1. Find a collaborator in the room
  2. Fork their perturbation experiment repo
  3. Clone fork to gadi (I recommend using gh)
  4. List available branches
  5. Choose branch and checkout using payu at specific commit with restart path
  6. Modify perturbation and run a single month
  7. push your branch back to your fork
  8. Add each other’s fork as a git remote and checkout their experiment

Long experiment run finished

Once the long experiment run has finished running change into the control directory and examine the outputs in archive/output000.

Note the different layout of the models

Ice

  • Output is stored in ice/OUTPUT directory
  • Diagnostic files contain multiple variables

Ocean

  • Outputs are (mostly) split into one diagnostic variable per file

Atmosphere