How to check programmatically if an ACCESS-ESM experiment is still running

I’m running a few experiments using ACCESS-ESM with payu and I wrote a script that checks their status (whether they ran al the steps correctly, whether the collation step was successful, whether the data was synced correctly, usw). I’d like to also know if the model is still running so I can distinguish between an experiment that is running fine but not completed and one that crashed midway for some reason.

Is there way of checking if the experiment is running? Maybe like a particular file that is only created once the model stops?

Excellent question!

Separately, any chance I could borrow your script to check which aspects of a bunch of ESM1-5 runs failed? I hacked something together to see which runs produced output files, and I have an inbox folder full of nci telling me somethingerother_c ran out of walltime, but it’d be nice to have something more systematic.

Sure! I copied it to /scratch/public/ec0044/verify-experiments.

A lot of the logic is not super general and will only work with my particular setup, but you can take it as a base to modify for your usecase.

I’ve got different experiments in an experiments folder. They all branch from a spinup experiment, which also has a “base” experiment that has all the config, each ensemble member branches from a different year with the name “m{year}”. Something like this:

experiments/
├── exp1
│   └── m0101
├── exp2
│   ├── m0101
│   └── m0110
├── exp3
│   └── m0101
└── spinup
    ├── base
    ├── m0101
    ├── m0110
    ├── m0120
    ├── m0130
...

My sync directory is then in analysis/data/raw_data/experiments/{experiment}/{member}. My spinup is one year long (so I need to check that there is only one output00* directory and my experiments are two years long (so I check for two).

The rest is checking that the archive folder has an atmosphere/netCDF folder (if not, then the collate step failed) and then checking that all folders that exist in the experiment folder are also in the sync folder.

I don’t know if it’s super robust, but it checks for all the problems I’ve had so far and seems to be working.

Hi Elio,
I haven’t looked at your script, but when I’m running something new that may or may not crash, I tend to hover in the run folder using watch ls
This runs the ls command every 2 seconds, so you can see new files get created etc. Anyway, when the model is in the middle of a run, you should see the files

access.out
access.err

And if the model crashes, you will see the file

job.yaml

The job.yaml eventually gets put into the output??? folder if the run completes successfully, but I look for the creation of that file to know when a run has crashed prematurely.

Hi Elio - normally we just look at qstat to find if it’s not running, and then check the ‘work’ directory has been archived.

See for details - Run ACCESS-ESM - ACCESS-Hive Docs

There is not an automated way that ACCESS-NRI provide currently to determine if a run has failed, but there is some work going on the better automate running ensembles of experiments. I’ve made an issue here to discuss this suggestion - but it won’t get implemented for a while.

Ah, nice. So he logic would be

  • if access.out exists, then the model is running
  • if job.yaml exists in the root of the experiment, then the model is not running and has crashed.
  • if job.yaml exists in the output folder, then the model is not running and finished correctly

@anton Yea, I’m checking qstat manually, but the issue is that if my verification script doesn’t take into account that a model is still running, then it might tell me that a model is in a wrong state because it’s not complete.

Yep, that’s kind of the idea. Just to be precise:

  • If access.out exists and job.yaml doesn’t exist, then the model is running
  • If access.out exists and job.yaml exists, then the model has crashed
  • If both access.out and job.yaml have been moved to the output folder, then everything finished successfully.
1 Like

Hi, this is related to an open issue in payu to make it easier to check the status of jobs in a payu run.

At the moment, job.yaml is written to the control directory (root experiment directory) once the model has been run. If the model exited with a error, it is also copied to an error_logs directory in archive before the job exits. Otherwise, it moves the job.yaml to the work directory, and then during the archive step, it is moved along with the outputs in the work directory to the archive directory. The PAYU_JOB_STATUS field in job.yaml has the exit code of the model run command too.

In my current work, I was changing the job.yaml to be written right at the end of payu run (before it exits/after archive) to try get some updated PBS resource stats of the current job from the scheduler, and would be possibly changing the fields. So it was good to see this forum post, so I’ll either keep the old job.yaml file for backwards compatibility or to make sure there is some other way to see where the payu job is up to.

3 Likes

Amazing. Some kind of supported “official” way of monitoring the status of a run would be fantastic.

In the meantime, I’ll use @dkhutch method, thanks!

2 Likes