As a step towards creating a Spack build for ACCESS-ESM1.5, I have been trying to use the “Run ACCESS-ESM1.5” instructions. I have tried to run ESM1.5 according to these instructions at least 4 times, and each time I see error code 139
.
Looking at /g/data/tm70/pcl851/src/penguian/esm-pre-industrial/access.err
I see that there are MPI failures in CICE4.1, with output like:
[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168 DCI QP 0x148aa wqe[153]: SEND s-e [rqpn 0x6afd rlid 5649] [va 0x15150f3ef280 len 1162 lkey 0x5370c27]
==== backtrace (tid: 39774) ====
0 0x0000000000023cab uct_ib_mlx5_completion_with_err() ???:0
1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed() ???:0
2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure() ???:0
3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll() :0
4 0x000000000003ee9a ucp_worker_progress() ???:0
5 0x0000000000003397 mca_pml_ucx_progress() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
6 0x000000000002f72b opal_progress() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
7 0x000000000004f2d5 sync_wait_st() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/threads/wait_sync.h:83
8 0x000000000004f2d5 ompi_request_default_wait_all() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/request/req_wait.c:243
9 0x000000000009213f PMPI_Waitall() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/pwaitall.c:80
10 0x00000000000537ed ompi_waitall_f() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
11 0x00000000006e5260 m_transfer_mp_waitrecv__() ???:0
12 0x00000000006e4106 m_transfer_mp_recv__() ???:0
13 0x00000000006243fc mod_oasis_advance_mp_oasis_advance_run_() /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_advance.F90:1130
14 0x00000000005ab868 mod_oasis_getput_interface_mp_oasis_get_r28_() /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_getput_interface.F90:760
15 0x0000000000452b7e cpl_interface_mp_from_ocn_() ???:0
16 0x000000000040eba8 cice_runmod_mp_cice_run_() ???:0
17 0x000000000040d312 MAIN__() ???:0
18 0x000000000040d2a2 main() ???:0
19 0x000000000003ad85 __libc_start_main() ???:0
20 0x000000000040d1ae _start() ???:0
I have also built using GitHub - penguian/access-esm-build-gadi: Fork to be used to migrate build to using GitHub repositories and in that case I see
[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168 DCI QP 0xacb8 wqe[142]: SEND s-e [rqpn 0x19ca8 rlid 301] [va 0x1499f2769180 len 1162 lkey 0x12cf5c]
==== backtrace (tid:1485991) ====
0 0x0000000000023cab uct_ib_mlx5_completion_with_err() ???:0
1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed() ???:0
2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure() ???:0
3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll() :0
4 0x000000000003ee9a ucp_worker_progress() ???:0
5 0x0000000000003397 mca_pml_ucx_progress() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
6 0x000000000002f72b opal_progress() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
7 0x000000000005c200 hcoll_ml_progress_impl() ???:0
8 0x0000000000023a92 _coll_ml_allreduce() ???:0
9 0x0000000000007bbc mca_coll_hcoll_reduce() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:278
10 0x0000000000086291 PMPI_Reduce() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:139
11 0x0000000000086291 opal_obj_update() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/class/opal_object.h:513
12 0x0000000000086291 PMPI_Reduce() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:142
13 0x00000000000512c3 ompi_reduce_f() /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/preduce_f.c:87
14 0x00000000005d8a40 mod_oasis_mpi_mp_oasis_mpi_sumr1_() /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_mpi.F90:1497
15 0x00000000007a5a9b mod_oasis_advance_mp_oasis_advance_avdiag_() /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1984
16 0x0000000000756b39 mod_oasis_advance_mp_oasis_advance_run_() /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1080
17 0x00000000005b1a34 mod_oasis_getput_interface_mp_oasis_put_r28_() /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_getput_interface.F90:567
18 0x000000000045ec78 cpl_interface_mp_into_atm_() ???:0
19 0x000000000040ed31 cice_runmod_mp_cice_run_() ???:0
20 0x000000000040d612 MAIN__() ???:0
21 0x000000000040d5a2 main() ???:0
22 0x000000000003ad85 __libc_start_main() ???:0
23 0x000000000040d4ae _start() ???:0
- Has anyone recently successfully run “Run ACCESS-ESM1.5”?
- Has anyone seen this type of MPI error previously?
- If so, how did you fix it?