An mpi error when running ACCESS-OM2

I got an error below

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD

with errorcode 1.

when I was running ACCESS-OM2-01 experiments. Can anyone identify what this error means and how to fix it?

There’s not enough information here to tell what’s gone wrong.

  • Is the error repeatable (that is, does it happen again when you do payu sweep; payu run)? Sometimes there are transient hardware errors on gadi and you just need to try again.
  • Are there other error messages giving an indication of what component failed and why?

Thanks @aekiss . I tried rerunning it, and it showed the same error information. There is also a note in access-om2.err.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on exactly when Open MPI kills them.

However, this error is not quite informative. I use the old version of the executable in this experiment. Could this be a reason for this crash?

This is just saying the model called the function MPI_Abort(), which ends the program. There may be information about why this function was called either above or below the messages you’ve provided, it may help if you gave more context.

Yes, more context and information is needed. Do you have a configuration that runs? If so, how is it different from the one that crashed?

I checked the error in access-om2.err, but I can only find this information.

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I also checked the ctrl_restore_of.e157690488 (the job error file), and it says below.

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.3
/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog
payu: Model exited with error code 1; aborting.

I can get the experiment running with the old payu version (maybe the hh5 one?). When I switched to the new payu version in vk83 at this time, it did not work anymore.

/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog

This is just warning that payu failed to commit changes to the experiment git repo. It isn’t the cause of your error.

I think you may need to provide the path to the access-om2.err file to get some assistance with why your experiment run crashed. If it is in a location that is inaccessible to other users, you may need to copy it to /scratch/public on gadi or some other location that is accessible to others.

Thanks @Aidan. The path of the experiment is /scratch/x77/hm1221/access-om2-01/01deg_jra55v13_ctrl_restore_off/.

Let me know if you can get access into this path.

Is there any clues in the ctrl_restore_of.e157831192 or ctrl_restore_of.o157831192 files?

I can’t see the contents as they’re only group readable and they’re owned by oz91.

If you want to give wider access run

chgrp x77 ctrl_restore_of.e157831192 ctrl_restore_of.o157831192
1 Like

I just changed the group.

In the PBS output file you have a lot of errors that look like

[gadi-cpu-clx-0838.gadi.nci.org.au:1829370] PMIX ERROR: UNREACHABLE in file /jobfs/35258476.gadi-pbs/0/openmpi/4.0.1/source/openmpi-4.0.1/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079

A search of the forum yields a couple of posts with a similar error:

https://forum.access-hive.org.au/search?q=pmix%20error%20unreachable

though unfortunately no explicit solution, but there are hints that it may be a problem with accessing files.

Has this worked in the past? If so what have you changed since then that it no longer works?

Unfortunately this is a COSIMA configuration, so not explicitly supported by ACCESS-NRI.

Thank you @Aidan. I did not change anything in the configuration since my last successful run. The only different thing is the version of Payu, since I did my last successful run by using Payu in hh5.

The last couple of lines in access-om2.out are shown here:

 Barotropic stability most nearly violated at T-cell (i,j) = (2824,2656), (lon,lat) = (    46.10,    86.67).

         The number of kmt-levels at this point is     72

         The dxt grid spacing (m) at this point is 0.250595E+04

         The dyt grid spacing (m) at this point is 0.463886E+04

         where the barotropic gravity wave speed is ~239.0 m/s.

         "dtbt" must be less than    9.000 sec.   dtbt =    6.750 sec.

Maybe it does not matter? Should I change the MPI version?

hh5 no longer exists - does your configuration refer to it?

No, I didn’t. I just referred ik11, x77, and e14.

The executables that I referred to are located in ik11, while not vk83. Would that be a problem?

No I don’t think so, as they run before they crash

As an aside, you should be very careful storing the model configuration in /scratch. This filesystem is automatically “cleaned” of files more than 90 days old.

Generally it is better to store your model configurations in your $HOME directory.

You have this

modules:
  use:
      - /g/data/vk83/modules
  load:
      - access-om2/2025.12.000
      - model-tools/mppnccombine-fast/2025.07.000

which loads ACCESS-NRI supported tools and model versions, but you’re using executable paths from ik11.

I would remove the modules section entirely, in case that is setting some environment variables that is confusing MPI.

Thanks everyone, I just got the model running. I just copied the namcouple file from one of the old runs to the new experiment, and it started working.

2 Likes