"Run ACCESS-ESM" fails with error code 139

As a step towards creating a Spack build for ACCESS-ESM1.5, I have been trying to use the “Run ACCESS-ESM1.5” instructions. I have tried to run ESM1.5 according to these instructions at least 4 times, and each time I see error code 139.

Looking at /g/data/tm70/pcl851/src/penguian/esm-pre-industrial/access.err I see that there are MPI failures in CICE4.1, with output like:

[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168  Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168  DCI QP 0x148aa wqe[153]: SEND s-e [rqpn 0x6afd rlid 5649] [va 0x15150f3ef280 len 1162 lkey 0x5370c27] 
==== backtrace (tid:  39774) ====
 0 0x0000000000023cab uct_ib_mlx5_completion_with_err()  ???:0
 1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed()  ???:0
 2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll()  :0
 4 0x000000000003ee9a ucp_worker_progress()  ???:0
 5 0x0000000000003397 mca_pml_ucx_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
 6 0x000000000002f72b opal_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
 7 0x000000000004f2d5 sync_wait_st()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/threads/wait_sync.h:83
 8 0x000000000004f2d5 ompi_request_default_wait_all()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/request/req_wait.c:243
 9 0x000000000009213f PMPI_Waitall()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/pwaitall.c:80
10 0x00000000000537ed ompi_waitall_f()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
11 0x00000000006e5260 m_transfer_mp_waitrecv__()  ???:0
12 0x00000000006e4106 m_transfer_mp_recv__()  ???:0
13 0x00000000006243fc mod_oasis_advance_mp_oasis_advance_run_()  /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_advance.F90:1130
14 0x00000000005ab868 mod_oasis_getput_interface_mp_oasis_get_r28_()  /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_getput_interface.F90:760
15 0x0000000000452b7e cpl_interface_mp_from_ocn_()  ???:0
16 0x000000000040eba8 cice_runmod_mp_cice_run_()  ???:0
17 0x000000000040d312 MAIN__()  ???:0
18 0x000000000040d2a2 main()  ???:0
19 0x000000000003ad85 __libc_start_main()  ???:0
20 0x000000000040d1ae _start()  ???:0

I have also built using GitHub - penguian/access-esm-build-gadi: Fork to be used to migrate build to using GitHub repositories and in that case I see

[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168  Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168  DCI QP 0xacb8 wqe[142]: SEND s-e [rqpn 0x19ca8 rlid 301] [va 0x1499f2769180 len 1162 lkey 0x12cf5c] 
==== backtrace (tid:1485991) ====
 0 0x0000000000023cab uct_ib_mlx5_completion_with_err()  ???:0
 1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed()  ???:0
 2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll()  :0
 4 0x000000000003ee9a ucp_worker_progress()  ???:0
 5 0x0000000000003397 mca_pml_ucx_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
 6 0x000000000002f72b opal_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
 7 0x000000000005c200 hcoll_ml_progress_impl()  ???:0
 8 0x0000000000023a92 _coll_ml_allreduce()  ???:0
 9 0x0000000000007bbc mca_coll_hcoll_reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:278
10 0x0000000000086291 PMPI_Reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:139
11 0x0000000000086291 opal_obj_update()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/class/opal_object.h:513
12 0x0000000000086291 PMPI_Reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:142
13 0x00000000000512c3 ompi_reduce_f()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/preduce_f.c:87
14 0x00000000005d8a40 mod_oasis_mpi_mp_oasis_mpi_sumr1_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_mpi.F90:1497
15 0x00000000007a5a9b mod_oasis_advance_mp_oasis_advance_avdiag_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1984
16 0x0000000000756b39 mod_oasis_advance_mp_oasis_advance_run_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1080
17 0x00000000005b1a34 mod_oasis_getput_interface_mp_oasis_put_r28_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_getput_interface.F90:567
18 0x000000000045ec78 cpl_interface_mp_into_atm_()  ???:0
19 0x000000000040ed31 cice_runmod_mp_cice_run_()  ???:0
20 0x000000000040d612 MAIN__()  ???:0
21 0x000000000040d5a2 main()  ???:0
22 0x000000000003ad85 __libc_start_main()  ???:0
23 0x000000000040d4ae _start()  ???:0
  1. Has anyone recently successfully run “Run ACCESS-ESM1.5”?
  2. Has anyone seen this type of MPI error previously?
  3. If so, how did you fix it?

Hi Paul,
I haven’t used that wiki specifically, but I did initiate my experiments using the repo mentioned in it:

That sets you up with the relevant input directories and executables. I found starting from that setup worked. I didn’t get errors until I started changing input files and directories.

Thanks David,
When was the last time you ran an ACCESS-ESM1.5 pre-industrial experiment from that repository?

It would have been close to a year ago that I first cloned the repo and ran a test case. Since then I’ve just adapted from the original.

I tried using the historical branch of GitHub - coecms/access-esm: Main Repository for ACCESS-ESM configurations to see if it also has the same problem, and it has different problems. It looks to me like the historical branch configuration is not compatible with the conda/analysis3-23.07 environment. In particular, I don’t know where the UMDIR environment variable is set, and what it is supposed to be set to. Has anyone else ( e.g. @Aidan @MartinDix ) recently run the historical branch configuration unchanged out of the box?

[pcl851@gadi-login-04 access-esm]$ cat historical.e*
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/stashmaster.py:259: UserWarning: 
Unable to load STASHmaster from version string, path does not exist
Path: $UMDIR/vn7.3/ctldata/STASHmaster/STASHmaster_A
Please check that the value of mule.stashmaster.STASHMASTER_PATH_PATTERN is correct for your site/configuration
  warnings.warn(msg)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: work/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid longitudes inconsistent
  File grid : 0.0 to 358.125, spacing 1.875
  Field grid: 0.5 to 359.5, spacing 1.0
  Extents should be within 1 field grid-spacing
Field validation failures:
  Fields (4935,4937,6676,6715)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
cdo    selyear (Warning): Year 101 not found!

cdo    selyear (Abort): No timesteps selected!
Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
payu: Model exited with error code 9; aborting.
[pcl851@gadi-login-04 access-esm]$ echo $UMDIR

Seems UMDIR is set in set_restart_year.sh

which is called in warm-start-payu.sh

or warm-start-csiro.sh, depending on which you choose

I tried changing

qsub -q normal -P tm70 -l walltime=6000 -l ncpus=384 -l mem=1536GB -N historical -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run

to

qsub -q normal -P tm70 -l walltime=6000 -l ncpus=384 -l mem=1536GB -N historical -l wd -j n -v UMDIR=/g/data/access/umdir,PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run

In other words, I added UMDIR=/g/data/access/umdir, to the environment variables argument to qsub, and ran again. This time, the result was

$ cat historical.e107209592
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: work/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4935,4937,6676,6715)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
cdo    selyear (Warning): Year 101 not found!

cdo    selyear (Abort): No timesteps selected!
Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs  
payu: Model exited with error code 9; aborting.

So the STASHmaster messages are no longer displayed, but the validation still fails. I don’t know where the year 101 is coming from, but `config.yaml has

calendar:
    start:
        # Check also 'MODEL_BASIS_TIME' in atmosphere namelists,
        # 'inidate' in ice namelists
        year: 1850
...

So perhaps one of these namelists is misconfigured?

So @Aidan , does this mean that I am running the payu commands in the wrong order, or could there be something else misconfigured such that the warm-start-*.sh scripts are not called? Also, is it not necessary to define UMDIR when you are doing a cold start?

The cdo command is being called in pre.sh

Is that helpful?

Debugging tip: GitHub doesn’t index branches, so if you want to search for strings either search the checked out files, or use the older versions which had an experiment config in separate repos as they’re in the main branch. They’ve not diverged that much so it’s a useful way to find things

https://github.com/search?q=repo%3Acoecms%2Fesm-historical+cdo&type=code

I just noticed that the historical branch has

restart: /g/data/access/payu/access-esm/restart/pre-industrial

in config.yaml. This does not make much sense to me. I was going to try changing it to

restart: /g/data/access/payu/access-esm/restart/historical

to see if that helps, but then I noticed that the restart directories are not interchangeable: they have different directory structures:

[pcl851@gadi-login-04 restart]$ pwd
/g/data/access/payu/access-esm/restart
[pcl851@gadi-login-04 restart]$ ls *
amip:
AM-09-t1.astart-19780101  restart_dump.astart  restart_dump.astart.orig

historical:
PI-01.astart-05410101

pmip-lm:
atmosphere  coupler  ice  ocean  README

pre-industrial:
atmosphere  coupler  ice  ocean

So changing from pre-industrial to historical won’t work.

I might have to just go back to trying pre-industrial again.

I tried again from scratch. Here is a summary of what I did and what I saw as output:

[pcl851@gadi-login-08 ~]$ cd /g/data/tm70/pcl851/src/coecms
[pcl851@gadi-login-08 coecms]$ module use /g/data/hh5/public/modules/
[pcl851@gadi-login-08 coecms]$ module load conda/analysis3-23.07
[pcl851@gadi-login-08 coecms]$ git clone https://github.com/coecms/access-esm
Cloning into 'access-esm'...
remote: Enumerating objects: 1625, done.
remote: Counting objects: 100% (1625/1625), done.
remote: Compressing objects: 100% (575/575), done.
remote: Total 1625 (delta 1042), reused 1621 (delta 1040), pack-reused 0
Receiving objects: 100% (1625/1625), 2.79 MiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1042/1042), done.
[pcl851@gadi-login-08 access-esm]$ git checkout pre-industrial
branch 'pre-industrial' set up to track 'origin/pre-industrial'.
Switched to a new branch 'pre-industrial'
[pcl851@gadi-login-08 access-esm]$ git checkout -b pre-industrial-test
Switched to a new branch 'pre-industrial-test'
[pcl851@gadi-login-08 access-esm]$ git status -uno
On branch pre-industrial-test
nothing to commit (use -u to show untracked files)
[pcl851@gadi-login-08 access-esm]$ payu --version
payu 1.0.19
[pcl851@gadi-login-08 access-esm]$ payu init
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
[pcl851@gadi-login-08 access-esm]$ ls
atmosphere  config.yaml  coupler  ice  manifests  ocean  README.md
[pcl851@gadi-login-08 access-esm]$ which payu
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu
[pcl851@gadi-login-08 access-esm]$ payu run
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=11400 -l ncpus=384 -l mem=1536GB -N pre-industrial -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run
107218724.gadi-pbs
[pcl851@gadi-login-08 access-esm]$ qstat -wax

gadi-pbs: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
107208813.gadi-pbs             pcl851          normal-exec     historical       2277040    8   384  1536g 01:40 F 00:00:52
107209557.gadi-pbs             pcl851          normal-exec     historical       2173353    8   384  1536g 01:40 F 00:00:42
107209592.gadi-pbs             pcl851          normal-exec     historical        164690    8   384  1536g 01:40 F 00:00:54
107218724.gadi-pbs             pcl851          normal-exec     pre-industrial   3517094    8   384  1536g 03:10 F 00:00:49
[pcl851@gadi-login-08 access-esm]$ cat pre-industrial.e107218724 
Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs  
payu: Model exited with error code 139; aborting.
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep um7.3x|wc -l
180
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep cicexx|wc -l
12
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep mom5xx|wc -l
180
[pcl851@gadi-login-08 access-esm]$ grep -i seg access.err|sort
[gadi-cpu-clx-1496:3517781:0:3517781] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1496:3517849:0:3517849] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1496:3517864:0:3517864] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482671:0:3482671] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482716:0:3482716] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482731:0:3482731] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121123:0:4121123] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121162:0:4121162] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121181:0:4121181] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842343:0:2842343] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842396:0:2842396] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842411:0:2842411] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
mpirun noticed that process rank 16 with PID 0 on node gadi-cpu-clx-1496 exited on signal 11 (Segmentation fault).
[pcl851@gadi-login-08 access-esm]$ grep -i seg access.err|grep -v mpirun|wc -l
12
[pcl851@gadi-login-08 access-esm]$ grep mpirun pre-industrial.o107218724 
mpirun  -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/atmosphere -np 192  /scratch/tm70/pcl851/access-esm/work/access-esm/atmosphere/um7.3x : -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/ocean -np 180  /scratch/tm70/pcl851/access-esm/work/access-esm/ocean/mom5xx : -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/ice -np 12  /scratch/tm70/pcl851/access-esm/work/access-esm/ice/cicexx

It looks like 12 of the 192 UM ranks are failing with a segfault. Interestingly, the CPUs look to be contiguous. I will now contact NCI to see if I can get a better idea of what is happening.

I have lodged HELP-193063 with NCI. In the meantime, could someone else with membership of access, hh5 and tm70 please try to reproduce what I have done, so that I can rule out a misconfiguration of the pcl851 home directory?

That might be ok. Historical presumably starts with pre-industrial initial conditions? Is that right @Scott?

I have recreated your error but not managed to get any further than that.

I’m afraid I don’t have time to devote to debugging this. We really need assistance from @holger and/or @Scott at this point I think.

Yes, the historical run will branch off of piControl, basic flow would be spinup (until model’s stable) → piControl → historical (~150 years) → SSP with the final state of each feeding into the next.

1 Like

The problem might have to do with the corrupted ESM1.5 pre-industrial restart file that Rachel Law mentioned. See also Porting CSIRO/UMUI ACCESS-ESM1.5 ksh run script to payu

@MartinDix do you have a working pre-industrial configuration for ACCESS ESM1.5 and does it use ksh scripts or does it use Payu?

See Confused on running a pre-industrial test of ACCESS-ESM1.5 with payu - #17 by MartinDix

The problem is that the restart files used by access-esm/config.yaml at pre-industrial · coecms/access-esm · GitHub are inconsistent with the pre-industrial configuration. @MartinDix uses correct restart files at the commit Use restarts from PI-02 year 101 · coecms/access-esm@0f769ae · GitHub
@MartinDix Is the current GitHub - MartinDix/access-esm at pre-industrial ready to merge back into GitHub - coecms/access-esm at pre-industrial for now?

See also GitHub - penguian/access-esm at pre-industrial-build-gadi and in particular Comparing coecms:pre-industrial...penguian:pre-industrial-build-gadi · coecms/access-esm · GitHub