Wall time exceeded error in ESM1.5

Hi,

I am running into the “PBS: job killed: walltime xxxx exceeded limit yyyy” in two of my simulations running with the ESM1.5. I used to have these errors a long time ago, and I still don’t know if other people had them (@dkhutch @gpontes @YanxuanD ?), but now they’re back and I can’t seem to find a fix for them. Could someone please help?

The simulations are present in /home/561/hs2549/49ka_rivrunoff_cice and /home/561/hs2549/access-esm1.5/lgm.

Thanks a lot,

Himadri

Hi @HIMADRI_SAINI,
I’ve recently had all my jobs time out for unknown reasons, which turned out to be Gadi system issues (which happen a lot). So the first thing is to check, did they re-occur if you tried sweep an re-run another day? (I’m guessing you already tried this… but double checking to be sure)

Hi @HIMADRI_SAINI and @dkhutch, we’ve also been experiencing walltime issues today and it is looking like a system issue. @manodeep has filed a ticket with NCI to report the issues.

Yes, I tried the sweep and re-ran it with different restart years, but got the same result. The run continues until the walltime defined in my config file and then gets killed once that time is reached.

Thanks, Spencer. Just FYI, I first ran into this issue few days ago and haven’t been able to fix it since then.

NCI help has replied to our question on this matter. Their systems team has noticed the problem, due to increase stress on the filesystems, in particular scratch, and have been investigating. They haven’t been able to pin down the issue yet and are still looking into it.

The load might be fine now but since the cause hasn’t been discovered yet, a slowdown might reoccur at any time.

1 Like

I have also had a timeout occur today… Hence I will check again on Monday to see if it’s running better.

I am still seeing a 2x slowdown in performance; and, as Claire said, NCI support know about the issue and monitoring/investigating.

@HIMADRI_SAINI I increased my wall time limit to 3 hours to deal with this, which could be worth trying in the interim. It seems to have helped for my runs.

1 Like

Thanks, David. I have just set it for 3 hours now.

Hi @HIMADRI_SAINI how did your jobs go? It looks like Gadi was having some problems last week leading to everyones’ jobs taking longer than normal.

Hi @edoyango Sorry, I missed to report back on this.

I increased the wall-time to 3 hrs as suggested by @dkhutch and the run has been continuing since.

1 Like