I have an ACCESS-CM2 run that failed/stopped I think due to not enough walltime.
It looks to me like the cycle (1Y) didn’t finish in the allocated time, but came very close. I requested 7H30M, should I just bump this up to 8H to ensure that 1year cycles finish in time? It is odd that other cycles have completed fine but not this one…
And now to restart this job I am a little unclear on the best practice.
Since it didn’t crash, but turned off due to resource usage, can I just update the walltime for the job and use rose suite-run --restart or do I need to use reload? I am also not sure whether I need to shut down the suit before doing anything?
This happens sometimes, I think it should finish in 8 hours. There might be another to way to do it but I would increase the wall time in rose edit, then run rose suite-run --reload and re-submit the task in the graphic interface gcylc <SUITE-ID> → coupled → Trigger (run now).
I’m assuming Sebastian is using the Broadwell nodes which are slower than the Cascade Lake nodes. Martin, have you been able to figure out why the Broadwell nodes are so slow these days? 5.5 hours of wall time used to be enough for a one-year simulation, but now it’s taking more like 7.5 hours (so only about 3 model years per day ). I personally don’t want to change to Cascade Lake as I found that I couldn’t get the same results due to a change in compiler (I think?).
Thanks for that Zoe. That has restarted now and will hopefully run to completion now
Yeah this is running on the Broadwell nodes
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
5
There is very little reason not to pad this walltime a bit to make sure your simulation runs.
The estimated walltime is used by the queuing system to optimise through-put. In some cases it may mean it will decide to run your job because it is waiting to free up enough resources for another job. By asking for a longer than strictly necessary walltime it might mean the scheduler can’t optimise everything optimally, or might slightly increase your queue time, but I doubt it makes any difference.
So … better to have slightly less than optimal PBS config rather than waste SUs and CPU cycles and the resultant hit to productivity by having a run fail to finish.
Hi i just want to update this as I have had another walltime issue, which happened overnight. See the screenshot below. The netcdf conversion on one year did not finish in time due to wall time run out. The error message on job.out for this is Exit Status: 271 (Linux Signal 15 SIGTERM Termination)
I am just wondering what to do now for the netcdf conversion, is there a way to restart this part of the job or what should I do? @Aidan ?
You can click on the netcdf_conversion task and submit it again. It will start from the point it finished in the last submission. That might be enough to finish this specific job.
If you want to increase the walltime, this can be done near line 514 of suite.rc. After modifying you can run rose suite-run --reload and re-submit the task. If I need to do this, I tend to wait for the current coupled job to finish and pause the future jobs because I don’t know if reloading the suite changes any currently running jobs.
Thanks for that Zoe.
I forgot you can resubmit jobs in the suite gui. I have done that and if it doesn’t finish i will try the other way.
Cheers,
Sebastian