Trouble shooting a stalled ACCESS-CM2 suite

hrsdawson · 11 December 2024 06:59

I have had a few CM2 suites hang/not succeed today and I can’t figure out why. This the GUI interface for one of my simulations that doesn’t want to run (u-dl669).

Nothing is showing up as red/failed in the GUI but I also don’t have any active jobs running for this suite and if I look at some of my output logs, then several tasks do show up as failed. For example, this is the job.err file for the fcm_make_um task.

I’m not sure if it’s related to this recent post. I’ve tried stopping the suite, restarting the suite, reloading the suite, deleting my local copy and rechecking out the suite, and I get the same result. I definitely don’t have the same folders in my ~/cylc-run/u-dl669/share/ directory that my successful runs have, but I’m unsure why.

Does anyone have any suggestions? I would be happy to start over (since nothing has actually run) and I have tried the HARD restart option, but that also didn’t work for me.

Scott · 11 December 2024 22:17

Check if ~/cylc-run/u-dl669 exists
Check if ~/cylc-run/u-dl669/share is a symlink and where it points (e.g. with readlink -f)
Check the ‘job’ script has the correct storage flags (hr22 and wherever share points to)

wghuneke · 11 December 2024 22:18

Have you tried a rose suite-clean? This has worked for me in the past.

hrsdawson · 11 December 2024 22:47

Yes, ~/cylc-run/u-dl669 exists. The ~/cylc-run/u-dl669/share is not a symlink and the job script does have the hr22 flag.

hrsdawson · 11 December 2024 22:52

I’ve just run the rose suite-clean. Should I now just try rerun again with rose suite-run?

hrsdawson · 11 December 2024 23:10

I tried this (rose suite-clean and then rose suite-run) and the exact same thing has happened again. No errors are thrown in the GUI, the jobs show up as green and ‘submitted’ but there’s no jobs running on gadi.

Paul.Gregory · 11 December 2024 23:47

Hi Hannah. Have you checked you disk quota in your home directory and any relevant projects?

I’ve had issues with rose/cylc in the past that have often been caused by disk quota issues.

hrsdawson · 11 December 2024 23:56

@Paul.Gregory my disk quota was exceeded yesterday and so I had wondered if this could be a problem. But then I removed data and copied a new suite (once my disk quota was under the limit) and the exact same thing happened, so I had ruled it out on this basis… Maybe I shouldn’t have? If that’s the problem, are these particular suites doomed to fail or is there a way to recover them so to speak?

Jhan · 12 December 2024 00:12

Assuming qstat shows the job in the queue, qstat -f will show you more info. Under “comments” it will tell you why your job is being held. It could even be a group quota issue

hrsdawson · 12 December 2024 00:16

Thanks @Jhan. The issue was that I didn’t even have any jobs submitted/in queue even though the GUI was telling me that I did.

I’m trying to rerun now (using rose suite-run --name dl669_try) and can see that some jobs are being held because Not enough free nodes available. I’ll just wait and see if this reattempt works when the jobs make it through the queue. Thanks!

hrsdawson · 12 December 2024 04:51

I don’t think the home disk quota was the problem. I think the problem is related to the project I was trying to run the simulations on. All my runs that have successfully got past the build stage have been run on e14. All my runs that are ‘held’ including a new suite that I created this afternoon have been run on ol01. If I change the project from ol01 to e14, the the simulation will at least run the build process. Is there any particular reason for this? ol01 still has compute time and doesn’t seem to be over the disk quota on scratch or gdata either.

clairecarouge · 12 December 2024 17:47

Storage flag somewhere missing ol01 maybe?

hrsdawson · 16 December 2024 04:27

I’ve tried adding storage flags to gdata/ol01 and scratch/ol01 and two particular suites (u-dl669 and u-dl670) always seems to stall.

However, I’ve been able to run a newer suite on ol01 which seems to be working. I guess the u-dl669 suite was unhappy as it was copied when my home disk quota was exceeded. I don’t really understand why u-dl670 also has this problem as I copied that suite once my disk quota was cleaned up and under limit. I also don’t know why deleting and reinstalling the suites doesn’t work now that my disk quota is under limit, but in any case I can run newer projects on ol01 if I explicitly include the storage flags in the suite.rc file so I guess this can be closed off?

Learnings: Don’t checkout or copy new suites if your disk quota is close to limit.

Topic		Replies	Views
SCM (u-cs845 suite) can't run Unified Model	9	254	29 June 2023
ACCESS-CM2 suite-run error Coupled Model help , access-cm2 , help-needed , error	14	127	24 February 2025
Cannot kill a suite Unified Model help , cylc	7	88	22 October 2024
Submit failed on rose stem test for Jules V7.1 Land Surface land , jules	2	304	15 December 2022
Uncollated ACCESS-CM2 output files Earth System help	25	172	16 September 2024

Trouble shooting a stalled ACCESS-CM2 suite

Related topics