Thanks a lot for providing this estimate. It sounds like you could potentially increase the number of years per job. E.g. 5 or 10 years per submission might be feasible. I don’t know whether that would help to get more years through the queue.
I asked similar questions during a recent meeting with @atteggiani . I think he is working on these things. Would be great to keep in touch with progress on this.
This would be expected to increase performance since you spend less time queuing. For example, with ACCESS-OM2-1 we typically run 10 years in a single job submission. Depending on the number of processors there is different limits on the walltime (see the last column in Queue Limits - NCI Help - Opus - NCI Confluence). Although also keep in mind that if you make your individual submissions really long and you get a random error in the middle then you have to start the whole run again. You may also want to think about how your output files are saved (i.e. if you’re saving only one file per submission they might get big).
@rmholmes Running 10 years in a single job should not take longer than 9 hrs for the N48 coupled suite. This estimated time is far less than the default wall time limit for 480 CPUs. We generally run 10-15 years in a single job for the N48 atmosphere-only version and want to follow the same in the coupled N48 suite. We believe this will enhance the effective computational performance of the model.
FYI we added the
runspersub option to
payu for models which either couldn’t, or didn’t want, to run for longer, but weren’t effectively using the max wall timeavailable
Does the N48 model still use UMUI to run?
@aidanheerdegen The coupled N48 suite is based on rose/cylc.
This depends on the model/scheduler you are using:
See @aidanheerdegen’s answer above:
FYI we added the
payufor models which either couldn’t, or didn’t want, to run for longer, but weren’t effectively using the max wall timeavailable
This can be done in Input/Output Control & Resources → Job submission, resources and re-submission pattern
At the bottom of the first pane there is a NEXT button, click on that and a new pane will open up.
You can “set the target run length for each job in the sequence” in the respective section.
Note that you might need to update your Job time limit (seconds per resubmission) and Job Memory limit in the first pane.
In the cylc editor (
rose edit -C <path_to_the_suite_folder>) under suite conf → Run Initialisation and Cycling you can change the Cylcing frequency (e.g. P10Y for 10 years re-submission chunks).
Here too, note that you might neet to change the Wallclock time as well (hh:mm:ss format)
Hope this helps.
@atteggiani I can change the run length per job in UMUIX and rose/cylc using the options that you mentioned above. But the rose/cylc suite crashes if the run length in a single queued job is > 2.5 yrs or so. @holger told me a critical number related to model timesteps and other factors for the suite. The suite doesn’t run if the number exceeds, so the run length per job is kept to 2 years.
@holger could you please remind me of the critical value and how you got it?
If I remember correctly, OASIS reads the length of the run in seconds from its configuration file as an 8-digit integer, and 99,999,999 seconds is a bit over 3 years. If you try to run the job longer (at least with umuix) what happened was the
namcouple file was written with 9-digit number, but only the first 8 digits were read in, so oasis believed the job to be finished after only a tenth of the job in.