Several of us, including @Dietmar_Dommenget, @HIMADRI_SAINI, @gpontes and myself have found that our simulations with ACCESS-ESM1.5 crash from time to time with no apparent reason. Usually all that is required is to sweep and resubmit. But, this takes time and generally we get slowed down by having to do this manually after every crash.
Aidan mentioned in the working group meeting that the sweep and resubmit can be done automatically. I think we would find this very useful. Could you please help ?
Thanks,
David
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
3
Background
The ACCESS-OM2 high resolution (0.1°) models often have issues with transient unreproducible errors.
For this reason a work-around was developed that when a run failed the logs are checked for error conditions that match one of the known transient errors. If a matching error is found the job is automatically resubmitted. There is also a safety mechanism built in that a job will only be resubmitted a set number of times in a row, in case the error is in fact reproducible, and requires manual remediation.
Adding this to your experiment
To add this functionality to your experiment copy the code below and run it in your experiment directory:
Assumes userscripts isn’t already defined in your config.yaml. If it is then you may need to manually add the userscripts commands.
Assumes model type is access
The resub.sh defines some common transient errors. If the transient error that is affecting your runs isn’t listed you will need to add it. See the technical detail below.
Technical detail
The payu userscripts functionality is used to call a script (resub.sh in the tools sub-directory) when the an error condition is returned from the model run (it crashed)
The number of times the model has been resubmitted is tracked in the resubmit.count file. When the model runs correctly the run userscript is invoked, which deletes the resubmit.count file, so that the limit on resubmission is only from the last failed run.
Currently the maximum number of resubmissions is set to 2:
This means at most the model will run and fail a total of 3 times maximum before payu halts. This was set for the relatively expensive high resolution OM2 model. If you have a cheaper model and find this has to be set to a higher value then do so.
All resubmissions (or failed resubmissions) are logged in resubmit.log, so you can check how often resubmission was required.
There are four error messages that are currently searched for to determine if resubmission is appropriate:
If your error condition isn’t listed you will need to copy it from the error log and add it to the script, and make sure to put it in quotes. It cannot be a multiline string.
Call to action
There is an open issue to add this functionality directly within payu
This is something ACCESS-NRI can prioritise if the problem is affecting a lot users, so please do reply to this topic if you’re also having similar trouble. Even better if you do make the changes above feel free to attach your resubmit.log files or summarise how often resubmission is required.
Also reply and let us know if there are other error conditions you had to add to get reliable resubmission after transient errors.
I have marked your post as the solution. It is possible that I will encounter different error codes that need to be incorporated. If so, I’ll report back.