21st March 2025 - How to run an ACCESS-OM2 or OM3 model

When?

11am Friday 21st March 2025, here.

What is this?

A tutorial on how to run the ACCESS-OM2 model for the first time (from rest, a restart or a perturbation!).

Writing credits: @cbull @helen @Aidan. Testing: @NoahDay.

Prerequisites:

Further information

Logging in to NCI

You need to join a NCI project (join your own project – ask a supervisor if you are unsure, then join vk83 and qv56)

ssh -X <your-NCI-username>@gadi.nci.org.au
  • Replace username with your username.
  • -X enables graphical forwarding (i.e. pictures)
File system Description
/home Backed up. 10 GiB fixed quota per user.
/scratch Not backed up, temporary files, auto purge policy applied.
/g/data Not backed up, long-term large data files.
/apps Read only, centrally installed software applications and their module files.
$PBS_JOBFS Not backed up, local to the node, I/O intensive data.
massdata Backed up, archiving large data files.
○ man mdss Read manual of all mdss commands
○ mdss dmls -l List files with status: online/in disk cache(REG), on tape(OFF), or both (DUL)
○ mdss put/get Put or retrieve files from mdss

Experiment manager: Payu

Payu is an experiment manager and model running tool. It is the tool used to run the ACCESS models covered in this tutorial.

Payu is written in python. See the documentation or the GitHub repository for more information.

The latest version is 1.1.6, and that is the minimum required for this tutorial. ACCESS-NRI provides supported conda environments for payu, which also contains other dependencies and tools required to run ACCESS-NRI supported models. These can be accessed via the module system:

module use /g/data/vk83/modules
module load payu

Running OM2

Okay, here’s where the magic happens! In the below we make an experiment directory (via mkdir and payu clone...), then run the 1deg_jra55_ryf configuration using payu run

mkdir -p ~/access-om2
cd ~/access-om2
payu clone --new-branch expt --branch release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs 1deg_jra55_ryf
#do a `git branch -vv`, and you'll note this has made a new local branch 'expt'
cd 1deg_jra55_ryf
payu run
What if I want to extend an existing experiment (e.g. perturbation from a control experiment)

Use clone command with --restart option

payu clone –-new-branch <my-branch> --branch <control-branch> --restart <folder-path> <URL> <my-folder>

where

  • <my-branch> :: name of your experiment branch
  • <control-branch> :: branch of control experiment in repo
  • <folder-path> :: path to the restart files on filesystem
  • <repo-URL> :: control experiment git repo (GitHub or local clone)
  • <my-folder> :: path of new experiment control directory

Optionally add --start-point <GIT_REF> to branch your experiment
from a specific git commit or branch that corresponds to the specified
restart folder. This means your experiment is starting with the exactly the same model configuration as it was when the restart was created.

Then change some experiment configuration as required, commit changes using git commit and then payu run.

Note, this process is very similar for ESM and CM, for example see ESM instructions here

Monitoring the run: PBS

Additional information here. And run man qstat.

Three particularly important commands:

  1. qsub job.sh - Submit job defined in the submission script job.sh
  2. qstat -swx - gives status of the job
  3. qdel <jobid> - Delete the job with jobID

Here’s an example of how to use qstat.
$ qstat -swx

The command -swx is made up of:

-s : Summary format - shows queue level status rather than individual jobs

-w: Wide format - displays output in multiple columns

-x: Extended/Expanded format - includes additional details in the output

The screenshot below is in regards to job 12345678.gadi -pbs

  1. User aaa777 submitted the job
  2. To the normal-exec queue
  3. They requested 48 cores and 190 GiB memory
  4. It requested 2:00 hours and has been running for 0:35:21
  5. The line at the bottom indicates when the job started, what Gadi node it is running on, 2697, and the space reserved on jobfs

Understanding a PBS script

PBS command Description
pbs -P Project for job debiting, /scratch project folder access and data ownership
pbs -q Submit the job to the queue
pbs -l ncpus= Request CPU cores
pbs -l storage=<scratch/prj1+gdata/prj2+massdata/prj3> Storage needed to be available inside the job. massdata is only available in copyq jobs.
pbs -l ngpus= Number of GPUs, ncpus has to be 12 x ngpus and the job has to be submitted to gpuvolta.
pbs -l walltime=hh:mm:ss Max walltime the job would run
pbs -l mem=<10GB> Memory allocation
pbs -l jobfs=<40GB> Disk allocation on compute/copyq node(s)
pbs -l software=<app1,app2> Licences required
pbs -l wd Start the job from the directory in which it was submitted
pbs -W depend=beforeok:<jobid1,jobid2> Set dependencies between this and other jobs.
pbs -a Time after which the job is eligible for execution
pbs -M <email@example.com,email2@anu.edu.au> List of receivers to whom email about the job is sent
pbs -m Email events. a for abortioin, b for begin, e for end, n for none

Model log files

While the model is running, payu saves the model standard output and error streams in the access-om2.out and access-om2.err files inside the control directory, respectively.
You can examine the contents of these files to check on the status of a run as it progresses (or after a failed run has completed).

At the end of a successful run these log files are archived to the archive directory and will no longer be found in the control directory. If they remain in the control directory after the PBS job for a run has completed it means the run has failed.

If the models crashes then most of the time, the errors will be detailed in these files (located in your control directory):
access-om2.err
access-om2.out

but if you need more information then check these files:
work/ocean/log/*
work/ICE/log/*
work/atmosphere/log/*

Model Live Diagnostics

ACCESS-NRI developed the Model Live Diagnostics framework to check, monitor, visualise, and evaluate model behaviour and progress of ACCESS models currently running on Gadi.
For a complete documentation on how to use this framework, check the Model Diagnostics documentation.

ACCESS-OM2 outputs

When your experiment has finished, a new folder will be created to store the outputs and linked to your control directory under `archive’

ls archive

will show you these folders:

metadata.yaml  output000  pbs_logs  restart000

The folder with the outputs most useful for science applications is output000 which contains a large number of netcdf files in subfolders.
To see netcdf files from the ice and ocean models (respectively) we can do:

 ls archive/output000/ice/OUTPUT/
 ls archive/output000/ocean/
What if I want Payu to copy my output to another location?

By default, payu laboratory and archive directories are in /scratch storage. The scratch filesystem is temporary storage where files not accessed for 100 days will be automatically removed. For this reason, it is often required for model outputs to be moved to /g/data/ for long-term data storage.

Payu has some syncing support using rsync commands under the hood which runs in a separate PBS job.

The sync subsection in config.yaml should look similar to the following:

sync:
    enable: true
    restarts: true
    path: /scratch/nf33/<replace-with-user-id>/tmp/test-sync-experiment-archive

(More details here)

Edit ACCESS-OM2 configuration

When editing your configuration, it is good practice to set runlog: true in config.yaml as your changes will automatically be committed when you run your experiments:

Queue settings

These are set in config.yaml. The default settings are:

queue: normal
walltime: 3:00:00
jobname: 1deg_jra55_ryf
mem: 1000GB

These set which queue you will be in, how much time you need to run for, the name of the job (that will appear when running qstat) and the amount of memory your run will need. You want to ask for the least amount of resources needed to do your job. Asking too much will result in longer queue time and asking for too little will slow down your job, or crash it.

Run consecutive years/restarts

Once your model run has finished, you can continue from the place that it stopped using

payu sweep
payu run

These commands will clean away the old run, set it up for a new run and then resubmit your job. The outputs from the new run will then be stored in a new folder:

 ls archive/output001/ice/OUTPUT/
 ls archive/output001/ocean/

This method can be cumbersome if you are running long simulations so see below for instructions on how to run longer simulations and automate the restarts

Change run length

The steps needed to change the length of the model run is different in ACCESS-ESM than ACCESS-OM2. In this tutorial we we focus on changing the length of ACCESS-OM2
Instructions on how to change the run length in ACCESS-ESM can be found here

One of the most common changes is to adjust the duration of the model run.
For example, when debugging changes to a model, it is common to reduce the run length to minimise resource consumption and return faster feedback on changes.

The run length is controlled by the restart_period field in the &date_manager_nmlsection of the ~/access-om2/1deg_jra55_ryf/accessom2.nml file:

&date_manager_nml
    forcing_start_date = '1958-01-01T00:00:00'
    forcing_end_date = '2019-01-01T00:00:00'<br>
    ! Runtime for a single segment/job/submit, format is years, months, seconds,
    ! two of which must be zero.
    restart_period = 5, 0, 0
&end

The format is restart_period = <number_of_years>, <number_of_months>, <number_of_days>.

For example, to make the model run for 1 year, 4 months and 10 days, change restart_period to:

restart_period = 1, 4, 10

Troubleshooting: Error and output files

Trouble-shooting: Payu

If payu doesn’t run correctly for some reason, a good first step is to run the following command from within the control directory:

payu setup

outputs from this command will look like this:

laboratory path: /scratch/$project/$user/access-om2
binary path: /scratch/$project/$user/access-om2/bin
input path: /scratch/$project/$user/access-om2/input
work path: /scratch/$project/$user/access-om2/work
archive path: /scratch/$project/$user/access-om2/archive
/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/payu/metadata.py:189: MetadataWarning: No pre-existing archive found. Generating a new uuid
Updated metadata. Experiment UUID: 2c91324a-c432-48ff-bbd8-71ed65163d7a
payu: Found modules in /opt/Modules/v4.3.0
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Setting up atmosphere
Setting up ocean
Setting up ice
Setting up access-om2
Checking exe, input and restart manifests
Writing manifests/input.yaml
Writing manifests/restart.yaml

This command will:

  • create the laboratory and work directories based on the experiment configuration
  • generate manifests
  • report useful information to the user, such as the location of the laboratory where the work and archive directories are located

If you run payu setup, make sure you run payu sweep before starting your experiment using payu run.

Recap for intake-datastore

Check in with @CharlesTurner and @anton - was creating intake catalog going to be automatic? Is there anything we should say about this?