MOM6 crashes after initialization

schmidt-christina · 9 February 2023 22:12

I have been running mom6-panan-005 fine for 3 months now without a problem submitting each months individually. Yesterday, I submitted another 3 months with payu run -n 3. The first month (i.e April 1991) ran fine, but then MOM6 crashed in the second month right after initialization:

Exiting coupler_init at 20230209 135016.510

*** longjmp causes uninitialized stack frame ***: /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf/symmetric_FMS2-e7d09b7 terminated

This error is in the first 20,000 lines of mom6.err

My directory is here /home/142/cs6673/payu/panan_005deg_jra55_ryf

angus-g · 9 February 2023 22:55

That’s a new one to me! Just to get the dumb questions out of the way, was your job terminated by PBS for running out of walltime/memory? The PBS outputs mom6_panan-005.{e,o}* are only readable by you by default.

schmidt-christina · 9 February 2023 23:07

Yes, it has indeed exceeded it’s walltime which I set to 3:30. Normally it takes about 3 hours for 1 months. It shouldn’t take 3.5 hours to initialize though or whatever it’s trying to do

angus-g · 9 February 2023 23:30

I seem to recall payu taking quite a while to generate the manifests for the restart files, which you wouldn’t have seen on the first segment. Not sure what right advice is here, maybe @aekiss would have a suggestion from experience with ACCESS-OM2-01?

adele157 · 9 February 2023 23:35

Did you try to resubmit and did you get the same error? Often there’s little glitches on the NCI side that resolve themselves if you just try again.

schmidt-christina · 9 February 2023 23:46

Not yet, but can do it now

aekiss · 9 February 2023 23:58

Yes, try running it again. Hopefully this is a transient NCI glitch.

I haven’t seen longjmp causes uninitialized stack frame since 2018 in a test version of ACCESS-OM2-01 on Raijin…

angus-g · 10 February 2023 00:41

I guess I was suggesting that it’s only a weird error because the model was killed due to running out of walltime (which was indeed the case). I suspect the root of that is because running a restart is taking longer than the original cold-start segment, which could be a few things:

slow restart input/initialisation?
payu calculating a restart manifest, which takes a while?

Aidan · 10 February 2023 01:21

Maybe. 135G of restarts is significant, also there are only 24 of them, so you’re not getting the full benefit of parallelising across the 48 cores.

To test how long it is taking you can just try

time payu setup

before your next run. This will compute the hashes for the restart manifest and give you an idea if that is a significant overhead.

I did have a notion to compute those as a post-processing job, or have a mode where this calculation was done outside the main PBS job, like a pre-processing step, but it wasn’t straightforward and didn’t seem necessary based on the use cases I’d tested on up till then which weren’t as taxing as this example.

This is the issue @aekiss was referring to for reference

github.com/COSIMA/access-om2

access-om2-01 hangs unpredictably - would also affect access-om2-025

opened 11:25PM - 29 Jan 18 UTC

closed 02:16AM - 06 Aug 19 UTC

aekiss

access-om2-01 hangs unpredictably for about a third of the submissions, burning …28kSU for nothing. I expect this problem to also affect access-om2-025 since this also relies on mxm. There's been extensive email discussion with @benmenadue, @nicjhan, @marshallward under subject "HELP-8799 MPI library problem? MOM not timestepping" (https://track.nci.org.au/servicedesk/customer/portal/5/HELP-8799) and Slack https://arccss.slack.com/archives/C08KM5VRA/p1516305480000282 but I want to organise key findings here. Symptoms: no timestepping (eg no output to `work/ice/OUTPUT`, which should get daily files). Eventually times out. Problem arose on 2018-01-17 when Ben set the MXM logfile output to /dev/null for performance reasons. Working hypothesis: MXM fsyncs its log target every 50 milliseconds by default, and this fails because MXM_LOG_FILE is pointing to a non-fsync-able file (/dev/null), causing it to abort processing the asynchronous callbacks. Workaround: use `-x MXM_LOG_FILE=$PBS_JOBFS/mxm.log` in mpirun in config.yaml, and submit with `/projects/v45/apps/payu/aek` which was modified to do shell substitution in the mpirun commands. This works most of the time but we still get unpredictable hangs. This seems to be an oasis problem. See Bens' email of 2018-01-25 re. `/short/v45/aek156/access-om2/control/01deg_jra55_ryf/work-3333791`: ``` There's 250 ranks stuck on line 123 of mod_oasis_method.F90 call oasis_mpi_barrier(mpi_comm_global) yet at least 5160 ranks have made it past that point and are stuck at line 231 of the same file call MPI_COMM_SPLIT(MPI_COMM_WORLD,icolor,ikey,mpi_comm_local,mpi_err) That should not have been able to happen, unless mpi_comm_global /= MPI_COMM_WORLD. ```

Aidan · 10 February 2023 03:10

Interesting, timing:

$ time parallel md5sum {} ::: $(find archive/restart002/ -iname "*.nc")
real	2m5.028s
user	6m40.891s
sys	4m7.886s

but for a single 7G file:

$ time md5sum archive/restart002/MOM.res_2.nc
9ca2d6455361062694d35609a05eec0d  archive/restart002/MOM.res_2.nc

real	0m16.347s
user	0m10.153s
sys	0m6.012s

I guess it is saturating file IO, limiting the speed with which they can be read in. So parallelisation is of some value, but not a universal panacea.

It is faster than serial, though not by as much as I would have thought:

$ time md5sum $(find archive/restart002/ -iname "*.nc")                                                                           

real    5m15.216s
user    3m19.317s
sys     1m54.614s

and using 4 processes is actually faster than 24:

$ time parallel -j 4 md5sum {} ::: $(find archive/restart002/ -iname "*.nc")                                                      

real    1m38.735s
user    3m27.332s
sys     2m26.171s

schmidt-christina · 10 February 2023 03:24

Ok, so here is an interesting update:

I am through the first month now. The output files 19910401.*.nc are however empty (time = UNLIMITED ; // (0 currently)

And it continues to run (atm at 1991/05/06) in the same pbs job 71980789.gadi-pbs. (I changed the walltime again to 5 hours for now, just in case). So I guess, it will run another ~1.5 hours and then run out of walltime some time in the second month.

I used this line payu sweep && payu run -n 3 and assumed that three different pbs jobs where submitted one after another. So I should be seeing a different pbs job number?

Aidan · 16 February 2023 00:04

These are the diagnostic files. It isn’t always useful to use these to diagnose the status of a run as the frequency of output to these files is determined by the configuration in diag_table.

I don’t quite understand this.

In any case, I can see your run finished, does that mean your problem is resolved?

schmidt-christina · 17 February 2023 05:06

Yes, it’s solved. It was a combination of running out of walltime and then not setting back the runtime to 1 month.

Topic		Replies	Views
MOM6 Calendar Error (Potentially Payu related?) Technical cosima , payu	14	469	25 May 2023
Payu setup error with mom6-panan COSIMA	1	237	26 January 2023
Inline post processing of payu output General help	5	26	17 February 2025
ESM1.5 payu: Model exited issue Earth System Model help , inscope , solved , access-esm	14	67	18 September 2024
MOM6 HDF crash error Technical cosima	14	284	25 May 2023

MOM6 crashes after initialization

Related topics