I’m running test-case simulations for the Lismore flood setup using ACCESS-rAM3, and I’m encountering an issue.
The OAS and RAS suites appear to start normally, but the jobs seem to sit for an extended period (up to ~1 day) without progressing, and then eventually fail or get terminated. In some cases, Cylc reports a submission timeout (P1D), and the suite shuts down automatically.
What I’ve checked so far:
Persistent session is active and accessible (SSH into the session host works)
Cylc suite starts and registers correctly
No obvious errors in initial suite startup logs
Jobs sometimes appear to be submitted but do not transition to running
Has anyone experienced similar behaviour with OAS/RAS suites?
Any suggestions on what to check (queue settings, resource requests, or Cylc configuration) would be really helpful.
qstat -f will show a comment field that may explain the reason a job is not running - could be something like your project is over quota.
Also there is system maintenance tomorrow, if a job has a long enough walltime to conflict with the maintenance period it won’t be started until after the maintenance.
Thanks for your reply. I did check the status of the tasks submitted in both the suites - it’s indicated that the jobs are finished. See attached.
[mm6452@gadi-cpu-bdw-0015 ~]$ qstat -f 167665179.gadi-pbs
qstat: 167665179.gadi-pbs Job has finished, use -x or -H to obtain historical job information
[1]+ Done rose suite-gcontrol --name=u-dk517
[mm6452@gadi-cpu-bdw-0015 ~]$ rose suite-gcontrol --name=u-bu503 &
[1] 131174
[mm6452@gadi-cpu-bdw-0015 ~]$ qstat -f 167665043.gadi-pbs
qstat: 167665043.gadi-pbs Job has finished, use -x or -H to obtain historical job information
[1]+ Done rose suite-gcontrol --name=u-bu503
But I am wondering it’s still showing submitted in both the GUI. Any reasons? Also plenty of tasks are on waiting.
Hi @Scott, just a bit more background info here too. Moulik had this issue for jobs submitted after the gadi glitch last week. We wondered if it was something to do with the system failure (there were reports of long queue times) so killed the suites and resubmitted them on Friday.
It could be that they won’t now start at this point due to the maintenance shutdown tomorrow, but I don’t think that’s the root cause of what’s going on.
I don’t have permission to check the status of that project.
Let me see if I can find someone who has if69 permissions to check that there are still resources and disk space on it.
If you right click on the tasks in the GUI there should be a ‘poll’ option that will check the job’s current status in PBS and update them in the GUI - otherwise the GUI only gets updated when the job actually starts.
Sorry just re-read your post. It seems like for some reason your jobs are trying to write to your home directory, they need to be writing to scratch. I haven’t got Gadi open at the moment, @Paul.Gregory or @bethanwhite can you please tell @Moulik how to fix his output directory (if that is it) …
Hi @cbengel - thanks for the replies. Please take your time - we can have a look when you are back at Gadi. Otherwise I am in contact with @bethanwhite and hope to get some help soon. No worries!
Hi @cbengel does a successful OAS run looks like that?
mm6452@gadi-cpu-bdw-0015 u-dk517]$ cd ..
[mm6452@gadi-cpu-bdw-0015 roses]$ cd ..
[mm6452@gadi-cpu-bdw-0015 ~]$ cylc cat-log u-dk517
2026-05-04T04:49:05Z INFO - Suite server: url=https://gadi-cpu-bdw-0015.gadi.nci.org.au:43046/ pid=180337
2026-05-04T04:49:05Z INFO - Run: (re)start=0 log=1
2026-05-04T04:49:05Z INFO - Cylc version: 7.9.9
2026-05-04T04:49:05Z INFO - Run mode: live
2026-05-04T04:49:05Z INFO - Initial point: 1
2026-05-04T04:49:05Z INFO - Final point: 1
2026-05-04T04:49:05Z INFO - Cold Start 1
2026-05-04T04:49:07Z INFO - [ostia_netcdf_to_pp.1] -submit-num=01, owner@host=gadi.nci.org.au
2026-05-04T04:49:08Z INFO - [client-command] get_latest_state mm6452@gadi-cpu-bdw-0015.gadi.nci.org.au:cylc-gui 0906f91f-d888-4b97-8f0a-d62fa5e958e1
2026-05-04T04:49:09Z INFO - [ostia_netcdf_to_pp.1] status=ready: (internal)submitted at 2026-05-04T04:49:09Z for job(01)
2026-05-04T04:49:09Z INFO - [ostia_netcdf_to_pp.1] -health check settings: submission timeout=P1D
2026-05-04T04:49:39Z INFO - Command succeeded: poll_tasks([u’ostia_netcdf_to_pp.1’], poll_succ=False)
2026-05-04T04:49:39Z INFO - Processing 1 queued command(s)
poll_tasks([u’ostia_netcdf_to_pp.1’], poll_succ=False)
2026-05-04T04:49:41Z INFO - [ostia_netcdf_to_pp.1] status=submitted: (polled)submitted at 2026-05-04T04:49:09Z for job(01)
2026-05-04T04:49:56Z INFO - [ostia_netcdf_to_pp.1] status=submitted: (received)started at 2026-05-04T04:49:55Z for job(01)
2026-05-04T04:49:56Z INFO - [ostia_netcdf_to_pp.1] -health check settings: execution timeout=PT11M, polling intervals=2*PT2M,PT7M,…
2026-05-04T04:50:58Z INFO - [ostia_netcdf_to_pp.1] status=running: (received)succeeded at 2026-05-04T04:50:57Z for job(01)
2026-05-04T04:51:03Z INFO - [ancil_sst_seaice.1] -submit-num=01, owner@host=gadi.nci.org.au
2026-05-04T04:51:06Z INFO - [ancil_sst_seaice.1] status=ready: (internal)submitted at 2026-05-04T04:51:06Z for job(01)
2026-05-04T04:51:06Z INFO - [ancil_sst_seaice.1] -health check settings: submission timeout=P1D
2026-05-04T04:51:50Z INFO - Command succeeded: poll_tasks([u’ancil_sst_seaice.1’], poll_succ=False)
2026-05-04T04:51:50Z INFO - Processing 1 queued command(s)
poll_tasks([u’ancil_sst_seaice.1’], poll_succ=False)
2026-05-04T04:51:52Z INFO - [ancil_sst_seaice.1] status=submitted: (polled)submitted at 2026-05-04T04:51:06Z for job(01)
2026-05-04T04:52:04Z INFO - [ancil_sst_seaice.1] status=submitted: (received)started at 2026-05-04T04:52:02Z for job(01)
2026-05-04T04:52:04Z INFO - [ancil_sst_seaice.1] -health check settings: execution timeout=PT20M, polling intervals=PT11M,PT2M,PT7M,…
2026-05-04T04:53:30Z INFO - [ancil_sst_seaice.1] status=running: (received)succeeded at 2026-05-04T04:53:30Z for job(01)
2026-05-04T04:53:32Z INFO - Command succeeded: poll_tasks([u’ancil_sst_seaice.1’], poll_succ=False)
2026-05-04T04:53:32Z INFO - Processing 1 queued command(s)
poll_tasks([u’ancil_sst_seaice.1’], poll_succ=False)
2026-05-04T04:53:42Z INFO - Suite shutting down - AUTOMATIC
2026-05-04T04:53:48Z INFO - DONE
@Moulik For any job starting today my first suspicion would be problems due to Gadi going down. Please check that you have restarted your persistent session by running `persistent-sessions list`. If it isn’t there you need to restart it using the same name you used last time.
Do that first then we can move on to other suggestions …