Automatic resubmission in payu ACCESS-ESM1.5

Hi @Aidan,

Several of us, including @Dietmar_Dommenget, @HIMADRI_SAINI, @gpontes and myself have found that our simulations with ACCESS-ESM1.5 crash from time to time with no apparent reason. Usually all that is required is to sweep and resubmit. But, this takes time and generally we get slowed down by having to do this manually after every crash.

Aidan mentioned in the working group meeting that the sweep and resubmit can be done automatically. I think we would find this very useful. Could you please help ?

Thanks,
David

Background

The ACCESS-OM2 high resolution (0.1°) models often have issues with transient unreproducible errors.

For this reason a work-around was developed that when a run failed the logs are checked for error conditions that match one of the known transient errors. If a matching error is found the job is automatically resubmitted. There is also a safety mechanism built in that a job will only be resubmitted a set number of times in a row, in case the error is in fact reproducible, and requires manual remediation.

Adding this to your experiment

To add this functionality to your experiment copy the code below and run it in your experiment directory:

mkdir tools
wget -P tools https://raw.githubusercontent.com/ACCESS-NRI/access-om2-configs/release-01deg_jra55_ryf/tools/resub.sh
cat<< 'EOF' >> config.yaml
userscripts:
    error: tools/resub.sh
    run: rm -f resubmit.count
EOF
sed -i 's/access-om2.err/access.err/' tools/resub.sh
git add tools config.yaml
git commit -m 'Added automated resubmission'

Notes:

  1. Assumes userscripts isn’t already defined in your config.yaml. If it is then you may need to manually add the userscripts commands.
  2. Assumes model type is access
  3. The resub.sh defines some common transient errors. If the transient error that is affecting your runs isn’t listed you will need to add it. See the technical detail below.

Technical detail

The payu userscripts functionality is used to call a script (resub.sh in the tools sub-directory) when the an error condition is returned from the model run (it crashed)

The number of times the model has been resubmitted is tracked in the resubmit.count file. When the model runs correctly the run userscript is invoked, which deletes the resubmit.count file, so that the limit on resubmission is only from the last failed run.

Currently the maximum number of resubmissions is set to 2:

This means at most the model will run and fail a total of 3 times maximum before payu halts. This was set for the relatively expensive high resolution OM2 model. If you have a cheaper model and find this has to be set to a higher value then do so.

All resubmissions (or failed resubmissions) are logged in resubmit.log, so you can check how often resubmission was required.

There are four error messages that are currently searched for to determine if resubmission is appropriate:

If your error condition isn’t listed you will need to copy it from the error log and add it to the script, and make sure to put it in quotes. It cannot be a multiline string.

Call to action

There is an open issue to add this functionality directly within payu

This is something ACCESS-NRI can prioritise if the problem is affecting a lot users, so please do reply to this topic if you’re also having similar trouble. Even better if you do make the changes above feel free to attach your resubmit.log files or summarise how often resubmission is required.

Also reply and let us know if there are other error conditions you had to add to get reliable resubmission after transient errors.

Thanks Aidan. But, doesn’t the code snippet below overwrite config.yaml, whereas you would actually want to append using cat<< 'EOF' >> config.yaml

1 Like

Ooops! Yes you’re correct. I’ve fixed the original code snippet, thanks for spotting my mistake.

1 Like

I have marked your post as the solution. It is possible that I will encounter different error codes that need to be incorporated. If so, I’ll report back.

1 Like