An mpi error when running ACCESS-OM2

Hangyum · 7 January 2026 03:07

I got an error below

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD

with errorcode 1.

when I was running ACCESS-OM2-01 experiments. Can anyone identify what this error means and how to fix it?

aekiss · 7 January 2026 05:31

There’s not enough information here to tell what’s gone wrong.

Is the error repeatable (that is, does it happen again when you do payu sweep; payu run)? Sometimes there are transient hardware errors on gadi and you just need to try again.
Are there other error messages giving an indication of what component failed and why?

Hangyum · 7 January 2026 23:38

Thanks @aekiss . I tried rerunning it, and it showed the same error information. There is also a note in access-om2.err.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on exactly when Open MPI kills them.

However, this error is not quite informative. I use the old version of the executable in this experiment. Could this be a reason for this crash?

Scott · 7 January 2026 23:51

This is just saying the model called the function MPI_Abort(), which ends the program. There may be information about why this function was called either above or below the messages you’ve provided, it may help if you gave more context.

aekiss · 8 January 2026 00:17

Yes, more context and information is needed. Do you have a configuration that runs? If so, how is it different from the one that crashed?

Hangyum · 8 January 2026 23:31

I checked the error in access-om2.err, but I can only find this information.

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I also checked the ctrl_restore_of.e157690488 (the job error file), and it says below.

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.3
/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog
payu: Model exited with error code 1; aborting.

I can get the experiment running with the old payu version (maybe the hh5 one?). When I switched to the new payu version in vk83 at this time, it did not work anymore.

Aidan · 9 January 2026 01:18

/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog

This is just warning that payu failed to commit changes to the experiment git repo. It isn’t the cause of your error.

I think you may need to provide the path to the access-om2.err file to get some assistance with why your experiment run crashed. If it is in a location that is inaccessible to other users, you may need to copy it to /scratch/public on gadi or some other location that is accessible to others.

Hangyum · 9 January 2026 03:07

Thanks @Aidan. The path of the experiment is /scratch/x77/hm1221/access-om2-01/01deg_jra55v13_ctrl_restore_off/.

Let me know if you can get access into this path.

Aidan · 9 January 2026 04:01

Is there any clues in the ctrl_restore_of.e157831192 or ctrl_restore_of.o157831192 files?

I can’t see the contents as they’re only group readable and they’re owned by oz91.

If you want to give wider access run

chgrp x77 ctrl_restore_of.e157831192 ctrl_restore_of.o157831192

Hangyum · 9 January 2026 04:12

I just changed the group.

Aidan · 9 January 2026 05:13

In the PBS output file you have a lot of errors that look like

[gadi-cpu-clx-0838.gadi.nci.org.au:1829370] PMIX ERROR: UNREACHABLE in file /jobfs/35258476.gadi-pbs/0/openmpi/4.0.1/source/openmpi-4.0.1/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079

A search of the forum yields a couple of posts with a similar error:

https://forum.access-hive.org.au/search?q=pmix%20error%20unreachable

though unfortunately no explicit solution, but there are hints that it may be a problem with accessing files.

Has this worked in the past? If so what have you changed since then that it no longer works?

Unfortunately this is a COSIMA configuration, so not explicitly supported by ACCESS-NRI.

Hangyum · 12 January 2026 03:32

Thank you @Aidan. I did not change anything in the configuration since my last successful run. The only different thing is the version of Payu, since I did my last successful run by using Payu in hh5.

Hangyum · 12 January 2026 03:43

The last couple of lines in access-om2.out are shown here:

 Barotropic stability most nearly violated at T-cell (i,j) = (2824,2656), (lon,lat) = (    46.10,    86.67).

         The number of kmt-levels at this point is     72

         The dxt grid spacing (m) at this point is 0.250595E+04

         The dyt grid spacing (m) at this point is 0.463886E+04

         where the barotropic gravity wave speed is ~239.0 m/s.

         "dtbt" must be less than    9.000 sec.   dtbt =    6.750 sec.

Maybe it does not matter? Should I change the MPI version?

aekiss · 12 January 2026 04:43

hh5 no longer exists - does your configuration refer to it?

Hangyum · 12 January 2026 05:04

No, I didn’t. I just referred ik11, x77, and e14.

Hangyum · 12 January 2026 05:42

The executables that I referred to are located in ik11, while not vk83. Would that be a problem?

aekiss · 12 January 2026 06:16

No I don’t think so, as they run before they crash

Aidan · 12 January 2026 06:30

As an aside, you should be very careful storing the model configuration in /scratch. This filesystem is automatically “cleaned” of files more than 90 days old.

Generally it is better to store your model configurations in your $HOME directory.

You have this

modules:
  use:
      - /g/data/vk83/modules
  load:
      - access-om2/2025.12.000
      - model-tools/mppnccombine-fast/2025.07.000

which loads ACCESS-NRI supported tools and model versions, but you’re using executable paths from ik11.

I would remove the modules section entirely, in case that is setting some environment variables that is confusing MPI.

Hangyum · 12 January 2026 23:36

Thanks everyone, I just got the model running. I just copied the namcouple file from one of the old runs to the new experiment, and it started working.

Topic		Replies	Views
"Run ACCESS-ESM" fails with error code 139 Earth System Model help , mpi , access-esm , cice4	18	570	12 February 2024
Restart problem (atmosphere) Earth System help , payu , restart , access-esm	26	457	29 July 2025
Installing ACCESS-OM2 on NeSI (New Zealand supercomputer) COSIMA help , access-om2 , technical , community-help	57	473	2 March 2026
ACCESS-AM2 BUILD error Atmosphere help , um , error , access-am2	6	69	11 August 2025
PAYU issues on Leonardo Technical help , payu	32	392	23 April 2025

An mpi error when running ACCESS-OM2

Related topics