ACCESS-CM2 walltime runout

Hi,

I have an ACCESS-CM2 run that failed/stopped I think due to not enough walltime.
It looks to me like the cycle (1Y) didn’t finish in the allocated time, but came very close. I requested 7H30M, should I just bump this up to 8H to ensure that 1year cycles finish in time? It is odd that other cycles have completed fine but not this one…

Here is details from From the job.out file.

Atm_Step: Timestep   183888   Model time:    957-12-29 00:00:00
EG_SISL_Resetcon: calculate reference profile

======================================================================================
                  Resource Usage on 2023-05-04 02:01:51:
   Job Id:             82496044.gadi-pbs
   Project:            e14
   Exit Status:        271 (Linux Signal 15 SIGTERM Termination)
   Service Units:      6578.54
   NCPUs Requested:    700                    NCPUs Used: 700
                                           CPU Time Used: 5026:50:43
   Memory Requested:   1.37TB                Memory Used: 470.34GB
   Walltime requested: 07:30:00            Walltime Used: 07:31:06
   JobFS requested:    16.0GB                 JobFS used: 183.62MB
======================================================================================

And now to restart this job I am a little unclear on the best practice.

Since it didn’t crash, but turned off due to resource usage, can I just update the walltime for the job and use rose suite-run --restart or do I need to use reload? I am also not sure whether I need to shut down the suit before doing anything?

Thanks in advance

Hi Sebastian,

This happens sometimes, I think it should finish in 8 hours. There might be another to way to do it but I would increase the wall time in rose edit, then run rose suite-run --reload and re-submit the task in the graphic interface gcylc <SUITE-ID>coupledTrigger (run now).

Hope that helps,
Zoe

1 Like

@MartinDix @Arnold_Sullivan late last year we had some emails about this long CM2 run time.

I’m assuming Sebastian is using the Broadwell nodes which are slower than the Cascade Lake nodes. Martin, have you been able to figure out why the Broadwell nodes are so slow these days? 5.5 hours of wall time used to be enough for a one-year simulation, but now it’s taking more like 7.5 hours (so only about 3 model years per day :frowning:). I personally don’t want to change to Cascade Lake as I found that I couldn’t get the same results due to a change in compiler (I think?).

1 Like

Thanks for that Zoe. That has restarted now and will hopefully run to completion now
Yeah this is running on the Broadwell nodes

There is very little reason not to pad this walltime a bit to make sure your simulation runs.

The estimated walltime is used by the queuing system to optimise through-put. In some cases it may mean it will decide to run your job because it is waiting to free up enough resources for another job. By asking for a longer than strictly necessary walltime it might mean the scheduler can’t optimise everything optimally, or might slightly increase your queue time, but I doubt it makes any difference.

So … better to have slightly less than optimal PBS config rather than waste SUs and CPU cycles and the resultant hit to productivity by having a run fail to finish.

1 Like

Hi i just want to update this as I have had another walltime issue, which happened overnight. See the screenshot below. The netcdf conversion on one year did not finish in time due to wall time run out. The error message on job.out for this is
Exit Status: 271 (Linux Signal 15 SIGTERM Termination)

I am just wondering what to do now for the netcdf conversion, is there a way to restart this part of the job or what should I do?
@Aidan ?

Hey Sebastian,

You can click on the netcdf_conversion task and submit it again. It will start from the point it finished in the last submission. That might be enough to finish this specific job.

If you want to increase the walltime, this can be done near line 514 of suite.rc. After modifying you can run rose suite-run --reload and re-submit the task. If I need to do this, I tend to wait for the current coupled job to finish and pause the future jobs because I don’t know if reloading the suite changes any currently running jobs.

Hope that helps :slight_smile:
Zoe

1 Like

Thanks for that Zoe.
I forgot you can resubmit jobs in the suite gui. I have done that and if it doesn’t finish i will try the other way.
Cheers,
Sebastian

Resubmitting in the gui worked, and it ran fine. thanks for that Zoe

1 Like