I have had a few CM2 suites hang/not succeed today and I can’t figure out why. This the GUI interface for one of my simulations that doesn’t want to run (u-dl669).
Nothing is showing up as red/failed in the GUI but I also don’t have any active jobs running for this suite and if I look at some of my output logs, then several tasks do show up as failed. For example, this is the job.err file for the fcm_make_um task.
I’m not sure if it’s related to this recent post. I’ve tried stopping the suite, restarting the suite, reloading the suite, deleting my local copy and rechecking out the suite, and I get the same result. I definitely don’t have the same folders in my ~/cylc-run/u-dl669/share/ directory that my successful runs have, but I’m unsure why.
Does anyone have any suggestions? I would be happy to start over (since nothing has actually run) and I have tried the HARD restart option, but that also didn’t work for me.
I tried this (rose suite-clean and then rose suite-run) and the exact same thing has happened again. No errors are thrown in the GUI, the jobs show up as green and ‘submitted’ but there’s no jobs running on gadi.
@Paul.Gregory my disk quota was exceeded yesterday and so I had wondered if this could be a problem. But then I removed data and copied a new suite (once my disk quota was under the limit) and the exact same thing happened, so I had ruled it out on this basis… Maybe I shouldn’t have? If that’s the problem, are these particular suites doomed to fail or is there a way to recover them so to speak?
Assuming qstat shows the job in the queue, qstat -f will show you more info. Under “comments” it will tell you why your job is being held. It could even be a group quota issue
Thanks @Jhan. The issue was that I didn’t even have any jobs submitted/in queue even though the GUI was telling me that I did.
I’m trying to rerun now (using rose suite-run --name dl669_try) and can see that some jobs are being held because Not enough free nodes available. I’ll just wait and see if this reattempt works when the jobs make it through the queue. Thanks!
I don’t think the home disk quota was the problem. I think the problem is related to the project I was trying to run the simulations on. All my runs that have successfully got past the build stage have been run on e14. All my runs that are ‘held’ including a new suite that I created this afternoon have been run on ol01. If I change the project from ol01 to e14, the the simulation will at least run the build process. Is there any particular reason for this? ol01 still has compute time and doesn’t seem to be over the disk quota on scratch or gdata either.
clairecarouge
(Claire Carouge, ACCESS-NRI Land Modelling Team Lead)
12
I’ve tried adding storage flags to gdata/ol01 and scratch/ol01 and two particular suites (u-dl669 and u-dl670) always seems to stall.
However, I’ve been able to run a newer suite on ol01 which seems to be working. I guess the u-dl669 suite was unhappy as it was copied when my home disk quota was exceeded. I don’t really understand why u-dl670 also has this problem as I copied that suite once my disk quota was cleaned up and under limit. I also don’t know why deleting and reinstalling the suites doesn’t work now that my disk quota is under limit, but in any case I can run newer projects on ol01 if I explicitly include the storage flags in the suite.rc file so I guess this can be closed off?
Learnings: Don’t checkout or copy new suites if your disk quota is close to limit.