How to check programmatically if an ACCESS-ESM experiment is still running

eliocamp · 2 July 2025 03:47

I’m running a few experiments using ACCESS-ESM with payu and I wrote a script that checks their status (whether they ran al the steps correctly, whether the collation step was successful, whether the data was synced correctly, usw). I’d like to also know if the model is still running so I can distinguish between an experiment that is running fine but not completed and one that crashed midway for some reason.

Is there way of checking if the experiment is running? Maybe like a particular file that is only created once the model stops?

jemmajeffree · 2 July 2025 03:50

Excellent question!

Separately, any chance I could borrow your script to check which aspects of a bunch of ESM1-5 runs failed? I hacked something together to see which runs produced output files, and I have an inbox folder full of nci telling me somethingerother_c ran out of walltime, but it’d be nice to have something more systematic.

eliocamp · 2 July 2025 04:04

Sure! I copied it to /scratch/public/ec0044/verify-experiments.

A lot of the logic is not super general and will only work with my particular setup, but you can take it as a base to modify for your usecase.

I’ve got different experiments in an experiments folder. They all branch from a spinup experiment, which also has a “base” experiment that has all the config, each ensemble member branches from a different year with the name “m{year}”. Something like this:

experiments/
├── exp1
│   └── m0101
├── exp2
│   ├── m0101
│   └── m0110
├── exp3
│   └── m0101
└── spinup
    ├── base
    ├── m0101
    ├── m0110
    ├── m0120
    ├── m0130
...

My sync directory is then in analysis/data/raw_data/experiments/{experiment}/{member}. My spinup is one year long (so I need to check that there is only one output00* directory and my experiments are two years long (so I check for two).

The rest is checking that the archive folder has an atmosphere/netCDF folder (if not, then the collate step failed) and then checking that all folders that exist in the experiment folder are also in the sync folder.

I don’t know if it’s super robust, but it checks for all the problems I’ve had so far and seems to be working.

dkhutch · 2 July 2025 05:54

Hi Elio,
I haven’t looked at your script, but when I’m running something new that may or may not crash, I tend to hover in the run folder using watch ls
This runs the ls command every 2 seconds, so you can see new files get created etc. Anyway, when the model is in the middle of a run, you should see the files

access.out
access.err

And if the model crashes, you will see the file

job.yaml

The job.yaml eventually gets put into the output??? folder if the run completes successfully, but I look for the creation of that file to know when a run has crashed prematurely.

anton · 2 July 2025 06:35

Hi Elio - normally we just look at qstat to find if it’s not running, and then check the ‘work’ directory has been archived.

See for details - Run ACCESS-ESM - ACCESS-Hive Docs

There is not an automated way that ACCESS-NRI provide currently to determine if a run has failed, but there is some work going on the better automate running ensembles of experiments. I’ve made an issue here to discuss this suggestion - but it won’t get implemented for a while.

eliocamp · 2 July 2025 07:36

Ah, nice. So he logic would be

if access.out exists, then the model is running
if job.yaml exists in the root of the experiment, then the model is not running and has crashed.
if job.yaml exists in the output folder, then the model is not running and finished correctly

@anton Yea, I’m checking qstat manually, but the issue is that if my verification script doesn’t take into account that a model is still running, then it might tell me that a model is in a wrong state because it’s not complete.

dkhutch · 3 July 2025 00:27

Yep, that’s kind of the idea. Just to be precise:

If access.out exists and job.yaml doesn’t exist, then the model is running
If access.out exists and job.yaml exists, then the model has crashed
If both access.out and job.yaml have been moved to the output folder, then everything finished successfully.

jo-basevi · 3 July 2025 00:44

Hi, this is related to an open issue in payu to make it easier to check the status of jobs in a payu run.

At the moment, job.yaml is written to the control directory (root experiment directory) once the model has been run. If the model exited with a error, it is also copied to an error_logs directory in archive before the job exits. Otherwise, it moves the job.yaml to the work directory, and then during the archive step, it is moved along with the outputs in the work directory to the archive directory. The PAYU_JOB_STATUS field in job.yaml has the exit code of the model run command too.

In my current work, I was changing the job.yaml to be written right at the end of payu run (before it exits/after archive) to try get some updated PBS resource stats of the current job from the scheduler, and would be possibly changing the fields. So it was good to see this forum post, so I’ll either keep the old job.yaml file for backwards compatibility or to make sure there is some other way to see where the payu job is up to.

eliocamp · 3 July 2025 03:46

Amazing. Some kind of supported “official” way of monitoring the status of a run would be fantastic.

In the meantime, I’ll use @dkhutch method, thanks!

minghangli · 20 August 2025 23:57

Thanks @anton for raising this in Check results programmatically · Issue #1 · ACCESS-NRI/access-experiment-runner · GitHub and apologies for getting back to this thread so late.

There is a test suite (2 python packages) designed to help researchers systematically generate and run ACCESS model experiments:

GitHub - ACCESS-NRI/access-experiment-generator: A tool for automating the setup and management of ACCESS models, with a focus on parameter sensitivity testing. It streamlines configuration, reduces manual edits, and enables consistent, repeatable workflows, which are ideal for large ensemble simulations. - already released and available to try.
GitHub - ACCESS-NRI/access-experiment-runner - scheduled for release next week, ahead of the [ACCESS-NRI COSIMA 2025 Training Program] (2025_ACCESS-NRI_COSIMA training schedule - Google Sheets) (29 Aug, Friday)

Basically (copied from Check results programmatically · Issue #1 · ACCESS-NRI/access-experiment-runner · GitHub), the runner tool can,

Successful run: If the job finishes with the expected run times, a message is displayed to notify the user that the runs are complete. No further jobs are submitted.
Failed run: If the job fails, the tool cleans the working directory and resubmits a new pbs job.
Job already running/queued: If a job is detected as already running or in the queue on Gadi, the tool notifies the user that this specific run is duplicated and skips it, moving on to the next job.

Topic		Replies	Views
Model development through UM7, ACCESS-ESM1.6 and payu Technical help , payu , solved	10	84	24 February 2025
Running Model Experiments with payu and git Training Day payu , training , workshop-2024	2	402	2 September 2024
Restart problem (atmosphere) Earth System help , payu , restart , access-esm	26	393	29 July 2025
21st March 2025 - How to run an ACCESS-OM2 or OM3 model 2025 training program cosima , access-om2 , payu , training	4	252	24 March 2025
13th Feb 2025 - Experiment manager and Payu 2025 training program	1	153	12 February 2025

How to check programmatically if an ACCESS-ESM experiment is still running

Related topics