Payu: resubmission log is not cleared/moved after a successful run

I am running mom6 and using the resub.sh script in config.yaml to resubmit jobs when common errors occur.
The resubmit.log doesn’t get moved anywhere after each run or when I do payu sweep. So now I have hit the maximum count, but this wasn’t all the same attempt to run the same month. Is there a better way now than just deleting or moving the log file somewhere?

1 Like

Can you paste in the log file here, and the resub.sh script, or a link to it on a GitHub repo?

I’ve added “payu” to the title and added the payu tag. I hope you don’t mind

1 Like

I can’t upload these files, but this is resub.sh

#!/usr/bin/bash

logfile='resubmit.log'
counterfile='resubmit.count'
outfile='mom6.err'

MAX_RESUBMISSIONS=6
date >> ${logfile}

# Define errors from which a resubmit is appropriate
declare -a errors=(
                   "Segmentation fault: address not mapped to object"
                   "Segmentation fault: invalid permissions for mapped object"
                   "Transport retry count exceeded"
                   "ORTE has lost communication with a remote daemon"
                   "MPI_ERRORS_ARE_FATAL"
		  )

resub=false
for error in "${errors[@]}"
do
  if grep -q "${error}" ${outfile}
  then
     echo "Error found: ${error}" >> ${logfile}
     resub=true
     break
  else
     echo "Error not found: ${error}" >> ${logfile}
  fi
done

if ! ${resub}
then
  echo "Error not eligible for resubmission" >> ${logfile}
  exit 0
fi

if [ -f "${counterfile}" ]
then
  PAYU_N_RESUB=$(cat ${counterfile})
else
  echo "Reset resubmission counter" >> ${logfile}
  PAYU_N_RESUB=${MAX_RESUBMISSIONS}
fi

echo "Resubmission counter: ${PAYU_N_RESUB}" >> ${logfile}

if [[ "${PAYU_N_RESUB}" -gt 0 ]]
then
  # Sweep and re-run
  ${PAYU_PATH}/payu sweep >> ${logfile}
  ${PAYU_PATH}/payu run -n ${PAYU_N_RUNS} >> ${logfile}
  # Decrement resub counter and save to counter file
  ((PAYU_N_RESUB=PAYU_N_RESUB-1))
  echo "${PAYU_N_RESUB}" > ${counterfile}
else
  echo "Resubmit limit reached ... " >> ${logfile}
  rm ${counterfile}
fi

echo "" >> ${logfile}

And here is the log file

Tue Apr 11 12:47:34 AEST 2023
Error found: Segmentation fault: address not mapped to object
Reset resubmission counter
Resubmission counter: 6
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Moving log mom6_panan-005.o78996287
Moving log mom6_panan-005.e78996287
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
79008476.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=19,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Sat Apr 15 13:16:10 AEST 2023
Error not found: Segmentation fault: address not mapped to object
Error not found: Segmentation fault: invalid permissions for mapped object
Error found: Transport retry count exceeded
Resubmission counter: 5
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Moving log mom6_panan-005.o79333676
Moving log mom6_panan-005.e79333676
Moving log mom6_panan-005.o79349441
Moving log mom6_panan-005.e79349441
Moving log mom6_panan-005.o79365785
Moving log mom6_panan-005.e79365785
Moving log mom6_panan-005.o79386696
Moving log mom6_panan-005.e79386696
Moving log mom6_panan-005.o79406062
Moving log mom6_panan-005.e79406062
Moving log mom6_panan-005.o79420868
Moving log mom6_panan-005.e79420868
Moving log mom6_panan-005.o79434231
Moving log mom6_panan-005.e79434231
Moving log mom6_panan-005.o79445319
Moving log mom6_panan-005.e79445319
Moving log mom6_panan-005.o79458386
Moving log mom6_panan-005.e79458386
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
79473613.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=3,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Mon Apr 17 17:02:32 AEST 2023
Error found: Segmentation fault: address not mapped to object
Resubmission counter: 4
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
79646817.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=59,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Wed Apr 19 08:39:30 AEST 2023
Error found: Segmentation fault: address not mapped to object
Resubmission counter: 3
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Moving log mom6_panan-005.o79639702
Moving log mom6_panan-005.e79639702
Moving log mom6_panan-005.o79646817
Moving log mom6_panan-005.e79646817
Moving log mom6_panan-005.o79665502
Moving log mom6_panan-005.e79665502
Moving log mom6_panan-005.o79691918
Moving log mom6_panan-005.e79691918
Moving log mom6_panan-005.o79715198
Moving log mom6_panan-005.e79715198
Moving log mom6_panan-005.o79735754
Moving log mom6_panan-005.e79735754
Moving log mom6_panan-005.o79755925
Moving log mom6_panan-005.e79755925
Moving log mom6_panan-005.o79768215
Moving log mom6_panan-005.e79768215
Moving log mom6_panan-005.o79805445
Moving log mom6_panan-005.e79805445
Moving log mom6_panan-005.o79849682
Moving log mom6_panan-005.e79849682
Moving log mom6_panan-005.o79891317
Moving log mom6_panan-005.e79891317
Moving log mom6_panan-005.o79932996
Moving log mom6_panan-005.e79932996
Moving log mom6_panan-005.o79960270
Moving log mom6_panan-005.e79960270
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
80000702.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=47,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Wed Apr 19 22:01:16 AEST 2023
Error found: Segmentation fault: address not mapped to object
Resubmission counter: 2
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Moving log mom6_panan-005.o79990195
Moving log mom6_panan-005.e79990195
Moving log mom6_panan-005.o80000702
Moving log mom6_panan-005.e80000702
Moving log mom6_panan-005.o80021406
Moving log mom6_panan-005.e80021406
Moving log mom6_panan-005.o80065011
Moving log mom6_panan-005.e80065011
Moving log mom6_panan-005.o80114837
Moving log mom6_panan-005.e80114837
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
80185076.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=43,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Thu Apr 20 09:37:40 AEST 2023
Error not found: Segmentation fault: address not mapped to object
Error not found: Segmentation fault: invalid permissions for mapped object
Error not found: Transport retry count exceeded
Error not found: ORTE has lost communication with a remote daemon
Error not found: MPI_ERRORS_ARE_FATAL
Error not eligible for resubmission
Thu Apr 20 17:41:47 AEST 2023
Error not found: Segmentation fault: address not mapped to object
Error not found: Segmentation fault: invalid permissions for mapped object
Error not found: Transport retry count exceeded
Error not found: ORTE has lost communication with a remote daemon
Error not found: MPI_ERRORS_ARE_FATAL
Error not eligible for resubmission
Fri Apr 21 11:12:48 AEST 2023
Error found: Segmentation fault: address not mapped to object
Resubmission counter: 1
laboratory path:  /scratch/e14/cs6673/mom6
binary path:  /scratch/e14/cs6673/mom6/bin
input path:  /scratch/e14/cs6673/mom6/input
work path:  /scratch/e14/cs6673/mom6/work
archive path:  /scratch/e14/cs6673/mom6/archive
Removing work path /scratch/e14/cs6673/mom6/work/panan_005deg_jra55_ryf
Removing symlink /home/142/cs6673/payu/panan_005deg_jra55_ryf/work
80562873.gadi-pbs
payu: warning: Job request includes 27 unused CPUs.
payu: warning: CPU request increased from 3717 to 3744
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P oz91 -l walltime=18000 -l ncpus=3744 -l mem=14976GB -l jobfs=10GB -N mom6_panan-005 -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin,PAYU_N_RUNS=36,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/e14+gdata/hh5+gdata/ik11+gdata/x77+scratch/e14 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/payu-run

Fri Apr 21 12:06:49 AEST 2023
Error found: Segmentation fault: address not mapped to object
Resubmission counter: 0
Resubmit limit reached ... 


You’re missing the script hook to remove the counter file when there is a successful resubmission. See here

Your config.yaml looks like this:

userscripts:
  error: resub.sh

So change it to

userscripts:
    error: resub.sh
    run: rm -f resubmit.count

and you should get the correct behaviour.

FYI none of this in your config.yaml Is strictly necessary:

storage:
  gdata:
    - e14
    - x77
    - ik11

platform:
  nodesize: 48

mpi:
  module: openmpi/4.1.2

payu checks the manifest files for storage paths, and adds the correct storage flags for you. As long as there are manifest files this should work, and if there aren’t you can generate them with payu setup at any time.

The default nodesize is 48, so this is only required if you’re using the broadwell, skylake or sapphire rapids cpus/queues. See the NCI docs for details of the physical hardware for each queue.

The correct openmpi module is loaded based on what was used to compile your executable, so you shouldn’t need to specify it explicitly. Arguably it is an anti-pattern, as it may lead to the incorrect openmpi library being used if the executable is updated, and should only be used when the correct module cannot be inferred, or there is some special reason to load a specific openmpi module.

1 Like

Thanks @Aidan, I will adjust my config and test it when we continue the run.

1 Like