MOM6 crashes after initialization

I have been running mom6-panan-005 fine for 3 months now without a problem submitting each months individually. Yesterday, I submitted another 3 months with payu run -n 3. The first month (i.e April 1991) ran fine, but then MOM6 crashed in the second month right after initialization:

Exiting coupler_init at 20230209 135016.510

*** longjmp causes uninitialized stack frame ***: /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf/symmetric_FMS2-e7d09b7 terminated

This error is in the first 20,000 lines of mom6.err

My directory is here /home/142/cs6673/payu/panan_005deg_jra55_ryf

That’s a new one to me! Just to get the dumb questions out of the way, was your job terminated by PBS for running out of walltime/memory? The PBS outputs mom6_panan-005.{e,o}* are only readable by you by default.

Yes, it has indeed exceeded it’s walltime which I set to 3:30. Normally it takes about 3 hours for 1 months. It shouldn’t take 3.5 hours to initialize though or whatever it’s trying to do

I seem to recall payu taking quite a while to generate the manifests for the restart files, which you wouldn’t have seen on the first segment. Not sure what right advice is here, maybe @aekiss would have a suggestion from experience with ACCESS-OM2-01?

Did you try to resubmit and did you get the same error? Often there’s little glitches on the NCI side that resolve themselves if you just try again.

Not yet, but can do it now

Yes, try running it again. Hopefully this is a transient NCI glitch.

I haven’t seen longjmp causes uninitialized stack frame since 2018 in a test version of ACCESS-OM2-01 on Raijin…

I guess I was suggesting that it’s only a weird error because the model was killed due to running out of walltime (which was indeed the case). I suspect the root of that is because running a restart is taking longer than the original cold-start segment, which could be a few things:

  • slow restart input/initialisation?
  • payu calculating a restart manifest, which takes a while?

Maybe. 135G of restarts is significant, also there are only 24 of them, so you’re not getting the full benefit of parallelising across the 48 cores.

To test how long it is taking you can just try

time payu setup

before your next run. This will compute the hashes for the restart manifest and give you an idea if that is a significant overhead.

I did have a notion to compute those as a post-processing job, or have a mode where this calculation was done outside the main PBS job, like a pre-processing step, but it wasn’t straightforward and didn’t seem necessary based on the use cases I’d tested on up till then which weren’t as taxing as this example.

This is the issue @aekiss was referring to for reference

1 Like

Interesting, timing:

$ time parallel md5sum {} ::: $(find archive/restart002/ -iname "*.nc")
real	2m5.028s
user	6m40.891s
sys	4m7.886s

but for a single 7G file:

$ time md5sum archive/restart002/
9ca2d6455361062694d35609a05eec0d  archive/restart002/

real	0m16.347s
user	0m10.153s
sys	0m6.012s

I guess it is saturating file IO, limiting the speed with which they can be read in. So parallelisation is of some value, but not a universal panacea.

It is faster than serial, though not by as much as I would have thought:

$ time md5sum $(find archive/restart002/ -iname "*.nc")                                                                           

real    5m15.216s
user    3m19.317s
sys     1m54.614s

and using 4 processes is actually faster than 24:

$ time parallel -j 4 md5sum {} ::: $(find archive/restart002/ -iname "*.nc")                                                      

real    1m38.735s
user    3m27.332s
sys     2m26.171s

Ok, so here is an interesting update:

I am through the first month now. The output files 19910401.*.nc are however empty (time = UNLIMITED ; // (0 currently)

And it continues to run (atm at 1991/05/06) in the same pbs job 71980789.gadi-pbs. (I changed the walltime again to 5 hours for now, just in case). So I guess, it will run another ~1.5 hours and then run out of walltime some time in the second month.

I used this line payu sweep && payu run -n 3 and assumed that three different pbs jobs where submitted one after another. So I should be seeing a different pbs job number?

These are the diagnostic files. It isn’t always useful to use these to diagnose the status of a run as the frequency of output to these files is determined by the configuration in diag_table.

I don’t quite understand this.

In any case, I can see your run finished, does that mean your problem is resolved?

Yes, it’s solved. It was a combination of running out of walltime and then not setting back the runtime to 1 month.