"Run ACCESS-ESM" fails with error code 139

paulleopardi · 24 January 2024 04:46

As a step towards creating a Spack build for ACCESS-ESM1.5, I have been trying to use the “Run ACCESS-ESM1.5” instructions. I have tried to run ESM1.5 according to these instructions at least 4 times, and each time I see error code 139.

Looking at /g/data/tm70/pcl851/src/penguian/esm-pre-industrial/access.err I see that there are MPI failures in CICE4.1, with output like:

[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168  Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-2636:39774:0:39774] ib_mlx5_log.c:168  DCI QP 0x148aa wqe[153]: SEND s-e [rqpn 0x6afd rlid 5649] [va 0x15150f3ef280 len 1162 lkey 0x5370c27] 
==== backtrace (tid:  39774) ====
 0 0x0000000000023cab uct_ib_mlx5_completion_with_err()  ???:0
 1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed()  ???:0
 2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll()  :0
 4 0x000000000003ee9a ucp_worker_progress()  ???:0
 5 0x0000000000003397 mca_pml_ucx_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
 6 0x000000000002f72b opal_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
 7 0x000000000004f2d5 sync_wait_st()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/threads/wait_sync.h:83
 8 0x000000000004f2d5 ompi_request_default_wait_all()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/request/req_wait.c:243
 9 0x000000000009213f PMPI_Waitall()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/pwaitall.c:80
10 0x00000000000537ed ompi_waitall_f()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/pwaitall_f.c:104
11 0x00000000006e5260 m_transfer_mp_waitrecv__()  ???:0
12 0x00000000006e4106 m_transfer_mp_recv__()  ???:0
13 0x00000000006243fc mod_oasis_advance_mp_oasis_advance_run_()  /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_advance.F90:1130
14 0x00000000005ab868 mod_oasis_getput_interface_mp_oasis_get_r28_()  /g/data/p66/pbd562/test/t47-hxw/jan20/4.0.2/oasis3-mct/lib/psmile/src/mod_oasis_getput_interface.F90:760
15 0x0000000000452b7e cpl_interface_mp_from_ocn_()  ???:0
16 0x000000000040eba8 cice_runmod_mp_cice_run_()  ???:0
17 0x000000000040d312 MAIN__()  ???:0
18 0x000000000040d2a2 main()  ???:0
19 0x000000000003ad85 __libc_start_main()  ???:0
20 0x000000000040d1ae _start()  ???:0

I have also built using GitHub - penguian/access-esm-build-gadi: Fork to be used to migrate build to using GitHub repositories and in that case I see

[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168  Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-0421:1485991:0:1485991] ib_mlx5_log.c:168  DCI QP 0xacb8 wqe[142]: SEND s-e [rqpn 0x19ca8 rlid 301] [va 0x1499f2769180 len 1162 lkey 0x12cf5c] 
==== backtrace (tid:1485991) ====
 0 0x0000000000023cab uct_ib_mlx5_completion_with_err()  ???:0
 1 0x0000000000054970 uct_dc_mlx5_iface_set_ep_failed()  ???:0
 2 0x000000000004d398 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x000000000004ff62 uct_dc_mlx5_iface_progress_ll()  :0
 4 0x000000000003ee9a ucp_worker_progress()  ???:0
 5 0x0000000000003397 mca_pml_ucx_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/pml/ucx/pml_ucx.c:515
 6 0x000000000002f72b opal_progress()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/runtime/opal_progress.c:231
 7 0x000000000005c200 hcoll_ml_progress_impl()  ???:0
 8 0x0000000000023a92 _coll_ml_allreduce()  ???:0
 9 0x0000000000007bbc mca_coll_hcoll_reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/ompi/mca/coll/hcoll/coll_hcoll_ops.c:278
10 0x0000000000086291 PMPI_Reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:139
11 0x0000000000086291 opal_obj_update()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/source/openmpi-4.0.2/opal/class/opal_object.h:513
12 0x0000000000086291 PMPI_Reduce()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/gcc/ompi/mpi/c/profile/preduce.c:142
13 0x00000000000512c3 ompi_reduce_f()  /jobfs/35249569.gadi-pbs/0/openmpi/4.0.2/build/intel/ompi/mpi/fortran/mpif-h/profile/preduce_f.c:87
14 0x00000000005d8a40 mod_oasis_mpi_mp_oasis_mpi_sumr1_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_mpi.F90:1497
15 0x00000000007a5a9b mod_oasis_advance_mp_oasis_advance_avdiag_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1984
16 0x0000000000756b39 mod_oasis_advance_mp_oasis_advance_run_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_advance.F90:1080
17 0x00000000005b1a34 mod_oasis_getput_interface_mp_oasis_put_r28_()  /home/599/mrd599/cylc-run/u-bp124/share/oasis3-mct_local/lib/psmile/src/mod_oasis_getput_interface.F90:567
18 0x000000000045ec78 cpl_interface_mp_into_atm_()  ???:0
19 0x000000000040ed31 cice_runmod_mp_cice_run_()  ???:0
20 0x000000000040d612 MAIN__()  ???:0
21 0x000000000040d5a2 main()  ???:0
22 0x000000000003ad85 __libc_start_main()  ???:0
23 0x000000000040d4ae _start()  ???:0

Has anyone recently successfully run “Run ACCESS-ESM1.5”?
Has anyone seen this type of MPI error previously?
If so, how did you fix it?

dkhutch · 24 January 2024 05:25

Hi Paul,
I haven’t used that wiki specifically, but I did initiate my experiments using the repo mentioned in it:

That sets you up with the relevant input directories and executables. I found starting from that setup worked. I didn’t get errors until I started changing input files and directories.

paulleopardi · 24 January 2024 05:38

Thanks David,
When was the last time you ran an ACCESS-ESM1.5 pre-industrial experiment from that repository?

dkhutch · 24 January 2024 23:17

It would have been close to a year ago that I first cloned the repo and ran a test case. Since then I’ve just adapted from the original.

paulleopardi · 28 January 2024 22:20

I tried using the historical branch of GitHub - coecms/access-esm: Main Repository for ACCESS-ESM configurations to see if it also has the same problem, and it has different problems. It looks to me like the historical branch configuration is not compatible with the conda/analysis3-23.07 environment. In particular, I don’t know where the UMDIR environment variable is set, and what it is supposed to be set to. Has anyone else ( e.g. @Aidan @MartinDix ) recently run the historical branch configuration unchanged out of the box?

[pcl851@gadi-login-04 access-esm]$ cat historical.e*
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/stashmaster.py:259: UserWarning: 
Unable to load STASHmaster from version string, path does not exist
Path: $UMDIR/vn7.3/ctldata/STASHmaster/STASHmaster_A
Please check that the value of mule.stashmaster.STASHMASTER_PATH_PATTERN is correct for your site/configuration
  warnings.warn(msg)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: work/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid longitudes inconsistent
  File grid : 0.0 to 358.125, spacing 1.875
  Field grid: 0.5 to 359.5, spacing 1.0
  Extents should be within 1 field grid-spacing
Field validation failures:
  Fields (4935,4937,6676,6715)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
cdo    selyear (Warning): Year 101 not found!

cdo    selyear (Abort): No timesteps selected!
Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
payu: Model exited with error code 9; aborting.
[pcl851@gadi-login-04 access-esm]$ echo $UMDIR

Aidan · 28 January 2024 22:39

Seems UMDIR is set in set_restart_year.sh

github.com

coecms/access-esm/blob/historical/scripts/set_restart_year.sh#L27


      
          
          # Sets the start year in each model component
          
          source /etc/profile.d/modules.sh
          module use /g/data/hh5/public/modules
          module load conda/analysis3
          module load nco
          
          set -eu
          trap "echo Error in set_restart_year.sh" ERR
          export UMDIR=~access/umdir
          
          # Load some helper scripts
          SCRIPTDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
          source $SCRIPTDIR/utils.sh
          
          # Starting year
          start_year=$1
          
          # Set the restart year in the atmosphere and ice namelists
          set_um_start_year $start_year

which is called in warm-start-payu.sh

github.com

coecms/access-esm/blob/historical/scripts/warm-start-payu.sh#L36


      
          if [ -d ${payu_restart} ]; then
              echo "ERROR: Restart directory already exists"
              echo "Consider 'payu sweep --hard' to delete all restarts"
              exit 1
          fi
          
          # Copy the source payu restart directory
          cp -r "$payu_source" "$payu_restart"
          
          # Set the year of each model component to the run start year
          $SCRIPTDIR/set_restart_year.sh $start_year
          
          # Cleanup to be ready to run the model
          payu sweep

or warm-start-csiro.sh, depending on which you choose

github.com

coecms/access-esm/blob/historical/scripts/warm-start-csiro.sh#L56


      
          for f in $csiro_source/ocn/*-${pyearend}; do
              cp -v $f $payu_restart/ocean/$(basename ${f%-*})
          done
          
          for f in $csiro_source/ice/*-${pyearend}; do
              cp -v $f $payu_restart/ice/$(basename ${f%-*})
          done
          cp -v $csiro_source/ice/iced.${yearstart} $payu_restart/ice/
          
          # Set the year of each model component to the run start year
          $SCRIPTDIR/set_restart_year.sh $start_year
          
          # Cleanup to be ready to run the model
          payu sweep

paulleopardi · 28 January 2024 22:50

I tried changing

qsub -q normal -P tm70 -l walltime=6000 -l ncpus=384 -l mem=1536GB -N historical -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run

to

qsub -q normal -P tm70 -l walltime=6000 -l ncpus=384 -l mem=1536GB -N historical -l wd -j n -v UMDIR=/g/data/access/umdir,PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,PAYU_FORCE=True,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run

In other words, I added UMDIR=/g/data/access/umdir, to the environment variables argument to qsub, and ran again. This time, the result was

$ cat historical.e107209592
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: work/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4935,4937,6676,6715)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
cdo    selyear (Warning): Year 101 not found!

cdo    selyear (Abort): No timesteps selected!
Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs  
payu: Model exited with error code 9; aborting.

So the STASHmaster messages are no longer displayed, but the validation still fails. I don’t know where the year 101 is coming from, but `config.yaml has

calendar:
    start:
        # Check also 'MODEL_BASIS_TIME' in atmosphere namelists,
        # 'inidate' in ice namelists
        year: 1850
...

So perhaps one of these namelists is misconfigured?

paulleopardi · 28 January 2024 22:56

So @Aidan , does this mean that I am running the payu commands in the wrong order, or could there be something else misconfigured such that the warm-start-*.sh scripts are not called? Also, is it not necessary to define UMDIR when you are doing a cold start?

Aidan · 28 January 2024 23:17

The cdo command is being called in pre.sh

github.com

coecms/access-esm/blob/historical/scripts/pre.sh#L20


      
          source  /etc/profile.d/modules.sh
          module use /g/data/hh5/public/modules
          module load conda/analysis3
          
          set -eu
          
          # Input land use file
          lu_file=$1
          
          # Get the current year from field t2_year of the restart file
          year=$(mule-pumf --component fixed_length_header work/atmosphere/restart_dump.astart | sed -n 's/.*t2_year\s*:\s*//p')
          
          # If that year is in the land use file, save a single timestep to a new netcdf file
          if cdo selyear,$(( year )) -chname,fraction,field1391 $lu_file work/atmosphere/land_frac.nc; then
          
              # Back up the original restart file
              mv work/atmosphere/restart_dump.astart work/atmosphere/restart_dump.astart.orig
          
              # Use the CSIRO script to set the land use
              python scripts/update_cable_vegfrac.py -i work/atmosphere/restart_dump.astart.orig -o work/atmosphere/restart_dump.astart -f work/atmosphere/land_frac.nc
          fi

Is that helpful?

Debugging tip: GitHub doesn’t index branches, so if you want to search for strings either search the checked out files, or use the older versions which had an experiment config in separate repos as they’re in the main branch. They’ve not diverged that much so it’s a useful way to find things

https://github.com/search?q=repo%3Acoecms%2Fesm-historical+cdo&type=code

paulleopardi · 28 January 2024 23:36

I just noticed that the historical branch has

restart: /g/data/access/payu/access-esm/restart/pre-industrial

in config.yaml. This does not make much sense to me. I was going to try changing it to

restart: /g/data/access/payu/access-esm/restart/historical

to see if that helps, but then I noticed that the restart directories are not interchangeable: they have different directory structures:

[pcl851@gadi-login-04 restart]$ pwd
/g/data/access/payu/access-esm/restart
[pcl851@gadi-login-04 restart]$ ls *
amip:
AM-09-t1.astart-19780101  restart_dump.astart  restart_dump.astart.orig

historical:
PI-01.astart-05410101

pmip-lm:
atmosphere  coupler  ice  ocean  README

pre-industrial:
atmosphere  coupler  ice  ocean

So changing from pre-industrial to historical won’t work.

I might have to just go back to trying pre-industrial again.

paulleopardi · 29 January 2024 02:29

I tried again from scratch. Here is a summary of what I did and what I saw as output:

[pcl851@gadi-login-08 ~]$ cd /g/data/tm70/pcl851/src/coecms
[pcl851@gadi-login-08 coecms]$ module use /g/data/hh5/public/modules/
[pcl851@gadi-login-08 coecms]$ module load conda/analysis3-23.07
[pcl851@gadi-login-08 coecms]$ git clone https://github.com/coecms/access-esm
Cloning into 'access-esm'...
remote: Enumerating objects: 1625, done.
remote: Counting objects: 100% (1625/1625), done.
remote: Compressing objects: 100% (575/575), done.
remote: Total 1625 (delta 1042), reused 1621 (delta 1040), pack-reused 0
Receiving objects: 100% (1625/1625), 2.79 MiB | 16.16 MiB/s, done.
Resolving deltas: 100% (1042/1042), done.
[pcl851@gadi-login-08 access-esm]$ git checkout pre-industrial
branch 'pre-industrial' set up to track 'origin/pre-industrial'.
Switched to a new branch 'pre-industrial'
[pcl851@gadi-login-08 access-esm]$ git checkout -b pre-industrial-test
Switched to a new branch 'pre-industrial-test'
[pcl851@gadi-login-08 access-esm]$ git status -uno
On branch pre-industrial-test
nothing to commit (use -u to show untracked files)
[pcl851@gadi-login-08 access-esm]$ payu --version
payu 1.0.19
[pcl851@gadi-login-08 access-esm]$ payu init
laboratory path:  /scratch/tm70/pcl851/access-esm
binary path:  /scratch/tm70/pcl851/access-esm/bin
input path:  /scratch/tm70/pcl851/access-esm/input
work path:  /scratch/tm70/pcl851/access-esm/work
archive path:  /scratch/tm70/pcl851/access-esm/archive
[pcl851@gadi-login-08 access-esm]$ ls
atmosphere  config.yaml  coupler  ice  manifests  ocean  README.md
[pcl851@gadi-login-08 access-esm]$ which payu
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu
[pcl851@gadi-login-08 access-esm]$ payu run
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=11400 -l ncpus=384 -l mem=1536GB -N pre-industrial -l wd -j n -v PAYU_PATH=/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/access+gdata/hh5+gdata/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/python3.10 /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/payu-run
107218724.gadi-pbs
[pcl851@gadi-login-08 access-esm]$ qstat -wax

gadi-pbs: 
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
107208813.gadi-pbs             pcl851          normal-exec     historical       2277040    8   384  1536g 01:40 F 00:00:52
107209557.gadi-pbs             pcl851          normal-exec     historical       2173353    8   384  1536g 01:40 F 00:00:42
107209592.gadi-pbs             pcl851          normal-exec     historical        164690    8   384  1536g 01:40 F 00:00:54
107218724.gadi-pbs             pcl851          normal-exec     pre-industrial   3517094    8   384  1536g 03:10 F 00:00:49
[pcl851@gadi-login-08 access-esm]$ cat pre-industrial.e107218724 
Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs  
payu: Model exited with error code 139; aborting.
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep um7.3x|wc -l
180
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep cicexx|wc -l
12
[pcl851@gadi-login-08 access-esm]$ grep -A2 forrtl access.err|grep mom5xx|wc -l
180
[pcl851@gadi-login-08 access-esm]$ grep -i seg access.err|sort
[gadi-cpu-clx-1496:3517781:0:3517781] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1496:3517849:0:3517849] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1496:3517864:0:3517864] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482671:0:3482671] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482716:0:3482716] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1497:3482731:0:3482731] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121123:0:4121123] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121162:0:4121162] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1498:4121181:0:4121181] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842343:0:2842343] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842396:0:2842396] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[gadi-cpu-clx-1499:2842411:0:2842411] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
mpirun noticed that process rank 16 with PID 0 on node gadi-cpu-clx-1496 exited on signal 11 (Segmentation fault).
[pcl851@gadi-login-08 access-esm]$ grep -i seg access.err|grep -v mpirun|wc -l
12
[pcl851@gadi-login-08 access-esm]$ grep mpirun pre-industrial.o107218724 
mpirun  -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/atmosphere -np 192  /scratch/tm70/pcl851/access-esm/work/access-esm/atmosphere/um7.3x : -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/ocean -np 180  /scratch/tm70/pcl851/access-esm/work/access-esm/ocean/mom5xx : -wdir /scratch/tm70/pcl851/access-esm/work/access-esm/ice -np 12  /scratch/tm70/pcl851/access-esm/work/access-esm/ice/cicexx

It looks like 12 of the 192 UM ranks are failing with a segfault. Interestingly, the CPUs look to be contiguous. I will now contact NCI to see if I can get a better idea of what is happening.

paulleopardi · 29 January 2024 02:43

I have lodged HELP-193063 with NCI. In the meantime, could someone else with membership of access, hh5 and tm70 please try to reproduce what I have done, so that I can rule out a misconfiguration of the pcl851 home directory?

Aidan · 29 January 2024 11:24

That might be ok. Historical presumably starts with pre-industrial initial conditions? Is that right @Scott?

Aidan · 29 January 2024 11:46

I have recreated your error but not managed to get any further than that.

I’m afraid I don’t have time to devote to debugging this. We really need assistance from @holger and/or @Scott at this point I think.

Scott · 29 January 2024 22:36

Yes, the historical run will branch off of piControl, basic flow would be spinup (until model’s stable) → piControl → historical (~150 years) → SSP with the final state of each feeding into the next.

paulleopardi · 1 February 2024 05:09

The problem might have to do with the corrupted ESM1.5 pre-industrial restart file that Rachel Law mentioned. See also Porting CSIRO/UMUI ACCESS-ESM1.5 ksh run script to payu

@MartinDix do you have a working pre-industrial configuration for ACCESS ESM1.5 and does it use ksh scripts or does it use Payu?

MartinDix · 1 February 2024 21:41

See Confused on running a pre-industrial test of ACCESS-ESM1.5 with payu - #17 by MartinDix

paulleopardi · 8 February 2024 00:16

The problem is that the restart files used by access-esm/config.yaml at pre-industrial · coecms/access-esm · GitHub are inconsistent with the pre-industrial configuration. @MartinDix uses correct restart files at the commit Use restarts from PI-02 year 101 · coecms/access-esm@0f769ae · GitHub
@MartinDix Is the current GitHub - MartinDix/access-esm at pre-industrial ready to merge back into GitHub - coecms/access-esm at pre-industrial for now?

See also GitHub - penguian/access-esm at pre-industrial-build-gadi and in particular Comparing coecms:pre-industrial...penguian:pre-industrial-build-gadi · coecms/access-esm · GitHub

Topic		Replies	Views
Restart problem (atmosphere) Earth System help , payu , restart , access-esm	26	314	29 July 2025
Restarting CSIRO CMIP6 historical runs Technical help , restart , wontfix , access-esm	24	564	12 September 2024
Tips, tricks, and troubleshooting for new ACCESS-ESM 1.5 user Earth System help , payu , access-esm	14	342	27 August 2024
ESM1.6 Development using NRI repos and PAYU CMIP7 development	26	100	27 November 2024
Porting CSIRO/UMUI ACCESS-ESM1.5 ksh run script to payu Earth System payu	6	246	18 January 2024

"Run ACCESS-ESM" fails with error code 139

Related topics