An mpi error when running ACCESS-OM2

I got an error below

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD

with errorcode 1.

when I was running ACCESS-OM2-01 experiments. Can anyone identify what this error means and how to fix it?

There’s not enough information here to tell what’s gone wrong.

  • Is the error repeatable (that is, does it happen again when you do payu sweep; payu run)? Sometimes there are transient hardware errors on gadi and you just need to try again.
  • Are there other error messages giving an indication of what component failed and why?

Thanks @aekiss . I tried rerunning it, and it showed the same error information. There is also a note in access-om2.err.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on exactly when Open MPI kills them.

However, this error is not quite informative. I use the old version of the executable in this experiment. Could this be a reason for this crash?

This is just saying the model called the function MPI_Abort(), which ends the program. There may be information about why this function was called either above or below the messages you’ve provided, it may help if you gave more context.

Yes, more context and information is needed. Do you have a configuration that runs? If so, how is it different from the one that crashed?

I checked the error in access-om2.err, but I can only find this information.

MPI_ABORT was invoked on rank 4359 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I also checked the ctrl_restore_of.e157690488 (the job error file), and it says below.

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.3
/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog
payu: Model exited with error code 1; aborting.

I can get the experiment running with the old payu version (maybe the hh5 one?). When I switched to the new payu version in vk83 at this time, it did not work anymore.

/g/data/vk83/apps/base_conda/envs/payu-1.2.0/lib/python3.10/site-packages/payu/runlog.py:110: UserWarning: Error occured when attempting to commit runlog

This is just warning that payu failed to commit changes to the experiment git repo. It isn’t the cause of your error.

I think you may need to provide the path to the access-om2.err file to get some assistance with why your experiment run crashed. If it is in a location that is inaccessible to other users, you may need to copy it to /scratch/public on gadi or some other location that is accessible to others.

Thanks @Aidan. The path of the experiment is /scratch/x77/hm1221/access-om2-01/01deg_jra55v13_ctrl_restore_off/.

Let me know if you can get access into this path.

Is there any clues in the ctrl_restore_of.e157831192 or ctrl_restore_of.o157831192 files?

I can’t see the contents as they’re only group readable and they’re owned by oz91.

If you want to give wider access run

chgrp x77 ctrl_restore_of.e157831192 ctrl_restore_of.o157831192
1 Like

I just changed the group.

In the PBS output file you have a lot of errors that look like

[gadi-cpu-clx-0838.gadi.nci.org.au:1829370] PMIX ERROR: UNREACHABLE in file /jobfs/35258476.gadi-pbs/0/openmpi/4.0.1/source/openmpi-4.0.1/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c at line 2079

A search of the forum yields a couple of posts with a similar error:

https://forum.access-hive.org.au/search?q=pmix%20error%20unreachable

though unfortunately no explicit solution, but there are hints that it may be a problem with accessing files.

Has this worked in the past? If so what have you changed since then that it no longer works?

Unfortunately this is a COSIMA configuration, so not explicitly supported by ACCESS-NRI.