I’ve made those changes to the domain sizing to no avail.
Does the stack trace in access-om3.err
show anything useful ?
If it’s clear the failure is in a component - look in the work/logs folder for that component.
If there’s no line numbers in the trace, or the error looks related to esmf or nuopc, have a look for PETxxxx
files in the work directory, and see what they say.
Thanks for that suggestion @anton
The stack trace in access-om3.err
contains the following
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 00001477015E3D10 Unknown Unknown Unknown
libmpi.so.40.30.5 000014770209A9DE Unknown Unknown Unknown
libopen-pal.so.40 00001476FC7CDD33 opal_progress Unknown Unknown
libopen-pal.so.40 00001476FC7CDEE5 ompi_sync_wait_mt Unknown Unknown
libmpi.so.40.30.5 000014770209E4F8 ompi_comm_nextcid Unknown Unknown
libmpi.so.40.30.5 00001477020AAB66 ompi_comm_create_ Unknown Unknown
libmpi.so.40.30.5 000014770207BFD0 PMPI_Comm_create_ Unknown Unknown
libmpi_mpifh.so 00001477023CA80E Unknown Unknown Unknown
access-om3-MOM6 0000000002D67DC2 mpp_mod_mp_get_pe 134 mpp_util_mpi.inc
access-om3-MOM6 0000000002E316EF mpp_mod_mp_mpp_in 80 mpp_comm_mpi.inc
access-om3-MOM6 0000000002C507B6 fms_mod_mp_fms_in 367 fms.F90
access-om3-MOM6 0000000001B8D94B mom_cap_mod_mp_in 537 mom_cap.F90
Line 537 of ./config_src/drivers/nuopc_cap/mom_cap.F90 is
call set_calendar_type (NOLEAP)
which is embedded in some logic to determine the kind of calendar. I’m not sure if that line reference (taken from MOM6/config_src/drivers/nuopc_cap/mom_cap.F90 at dev/access · ACCESS-NRI/MOM6 · GitHub) is relevant to what I’m using, as that line above is contained in
subroutine InitializeAdvertise
which isn’t referred to in the stack trace.
Here are the contents of the PET00.ESMF_LogFile
$ more work/PET00.ESMF_LogFile
20250225 143228.947 ERROR PET00 src/addon/NUOPC/src/NUOPC_Base.F90:2108 Invalid argument - Fixx_rofi is not a StandardName in the NUOPC_FieldDictionary!
20250225 143228.947 ERROR PET00 src/addon/NUOPC/src/NUOPC_Base.F90:486 Invalid argument - Passing error in return code
20250225 143228.947 ERROR PET00 med.F90:913 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument - Phase 'IPDv03p1' Initialize for modelComp 1: MED did not return ESMF_SUCCESS
20250225 143228.948 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1331 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCC
ESS
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1326 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 esmApp.F90:134 Invalid argument - Passing error in return code
20250225 143228.948 INFO PET00 Finalizing ESMF
I was doing some work w/ACCESS-CM3 in my home directory. I’ve now restarted that work on a seperate drive (/g/data/gb02
). Maybe it’s best to purge what I’ve done so far in my home directories and start afresh.
This normally means the fd.yaml
is inconsistent with the executable version being used.
Its a bit hard to connect this with the stack trace though, possibly the stack trace is not for the processor that cause the abort? It might be just waiting at this point and told to abort by a different processor.
The lines numbers are modified by the patches at build time (currently access-om3/MOM6/patches/mom_cap.F90.patch at 4f278cc1af1c278a765f5f9738add889d3166ed5 · COSIMA/access-om3 · GitHub ) so can be quite hard to follow.
@anton - there were some recent changed to fd.yaml here:
and we are using the prerelease module:
modules:
use:
- /g/data/vk83/prerelease/modules
load:
- access-om3/pr30-5
Is there a chance that these are now inconsistent?
It depends on which version of CMEPS is used in pr30-5. The major changes in fd.yaml occurred in cmeps 0.14.60. You can check the differences between cmeps 0.14.59 and 0.14.60 here Comparing cmeps0.14.59...cmeps0.14.60 · ESCOMP/CMEPS · GitHub
It should be ok,
access-om3/pr30-5
uses Release 0.3.1 · COSIMA/access-om3 · GitHub and the fd.yaml in the regional branch is consistent with that release.
It sound like @Paul.Gregory might have accidentally used one from a CM3 test branch
Ahh – Thanks @anton and @minghangli the instructions actually point to the dev-1deg_jra55do_iaf branch (to reduce the number of branches that need updating) - so this may be the issue!
@Paul.Gregory – when you rerun, can you please try switching which branch you download
Under the heading
“Download your other configuration files from an ACCESS_OM3 run”
Can you swap
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-1deg_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
To
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
The remainder of the instructions may be a little different as the original text that you are changing will be different (and some of the changes may now not be necessary).
Oh I see! - This branch should run fine with the default binary (2025.01.0) then and not need the one in pr30-5 as mom_symmettric is now on by default
Even better! Thanks Anton
@Paul.Gregory – an alternative (and better) thing to try
In your config.yaml file can you change to this
modules:
use:
- /g/data/vk83/modules
load:
- access-om3/2025.01.0
- nco/5.0.5
Ok. Here are my morning’s efforts.
- Delete my
~/access-om3/ directory
- From my home directory:
$ git clone --branch dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/
Then.
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
Now to edit the input files.
In MOM_input
- All paths are correct, i.e. no need to remove ‘forcing/’ directory.
- There are no OBC_SEGMENT entries.
- The NUOPC section already exists at the end of the MOM_input file.
In config.yaml
- Change the
scratch
path to/scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/
exe: access-om3-MOM6
is already fixed.- Change the module path to
use:
- /g/data/vk83/modules
load:
- access-om3/2025.01.0
- nco/5.0.5
setup
is already commented out.
In datm_in
- The mask and mesh files are already set. Note - they are the same file.
- Set
nx_global
andny_global
to 140 and 249
In drof_in
- The mask and mesh files are already set. Note - they are the same file.
- Set
nx_global
andny_global
to 140 and 249
In input.nml
parameter_filename
is already set.
In nuopcy.runconfig
ocn_ntasks = 100
already setocn_rootpe = 0
already setstart_ymd = 20130101
already setstop_n = 2
already setstop_option = ndays
already setrestart_n = 2
already setrestart_option = ndays
already setmesh_mask = ./INPUT/access-rom3-ESMFmesh.nc
already setmesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc
already setcomponent_list: MED ATM OCN ROF
already setICE_model = sice
already set
In nuopc.runseq
- already cleared of ‘ice’ entries
In diag_table
- output options set
Now to run from ~/access-rom3
Loading payu/dev-20250220T210827Z-39e4b9b
ERROR: payu/dev-20250220T210827Z-39e4b9b cannot be loaded due to a conflict.
HINT: Might try "module unload payu/1.1.5" first.
ok
$ module list
Currently Loaded Modulefiles:
1) pbs
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
Loading payu/dev-20250220T210827Z-39e4b9b
Loading requirement: singularity
$ payu setup
laboratory path: /scratch/gb02/pag548/access-om3
binary path: /scratch/gb02/pag548/access-om3/bin
input path: /scratch/gb02/pag548/access-om3/input
work path: /scratch/gb02/pag548/access-om3/work
archive path: /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
payu: error: work path already exists: /scratch/gb02/pag548/access-om3/work/access-rom3.
payu sweep and then payu run
$ payu sweep
laboratory path: /scratch/gb02/pag548/access-om3
binary path: /scratch/gb02/pag548/access-om3/bin
input path: /scratch/gb02/pag548/access-om3/input
work path: /scratch/gb02/pag548/access-om3/work
archive path: /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Removing work path /scratch/gb02/pag548/access-om3/work/access-rom3
Removing symlink /home/548/pag548/access-om3/access-rom3/work
$ payu run
payu: warning: Job request includes 44 unused CPUs.
payu: warning: CPU request increased from 100 to 144
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P gb02 -l walltime=01:00:00 -l ncpus=144 -l mem=100GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/vk83/prerelease/modules:/g/data/vk83/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/qv56+gdata/vk83 -- /g/data/vk83/prerelease/./apps/conda_scripts/payu-dev-20250220T210827Z-39e4b9b.d/bin/python /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin/payu-run
135974940.gadi-pbs
Error remains the same. Stack trace from access-om3.err
[gadi-cpu-clx-2426.gadi.nci.org.au:1404924] PMIX ERROR: UNREACHABLE in file /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c a
t line 2198
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 000014F8FE573D10 Unknown Unknown Unknown
libmpi.so.40.30.7 000014F8FF02C169 Unknown Unknown Unknown
libopen-pal.so.40 000014F8F9D68923 opal_progress Unknown Unknown
libopen-pal.so.40 000014F8F9D68AD5 ompi_sync_wait_mt Unknown Unknown
libmpi.so.40.30.7 000014F8FF02FC78 ompi_comm_nextcid Unknown Unknown
libmpi.so.40.30.7 000014F8FF03C346 ompi_comm_create_ Unknown Unknown
libmpi.so.40.30.7 000014F8FF00DA00 PMPI_Comm_create_ Unknown Unknown
libmpi_mpifh.so 000014F8FF35C81E Unknown Unknown Unknown
access-om3-MOM6 0000000002F59512 mpp_mod_mp_get_pe 138 mpp_util_mpi.inc
access-om3-MOM6 0000000003025B48 mpp_mod_mp_mpp_in 80 mpp_comm_mpi.inc
access-om3-MOM6 0000000002E302F6 fms_mod_mp_fms_in 367 fms.F90
access-om3-MOM6 0000000001BE52F0 mom_cap_mod_mp_in 545 mom_cap.F90
Cheers
@Paul.Gregory thanks for a thorough description! Actually, it is best if you only do one of my suggested changes. The issue was are tyring to test now is that we think the branch we were using is not compatible with the executable we were using. So we need to update either the branch or the executable.
Can you try switching config.yaml back to:
modules:
use:
- /g/data/vk83/prerelease/modules
load:
- access-om3/pr30-5
Sorry for the confusion!
21 posts were split to a new topic: ESMF mesh and MOM6 domain masks are inconsistent
Ok I’ve got @mmr0 's config up and running now. It seems to make progress and actually run because the following directory now exists:
/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/
However the run fails after a few minutes with
ls: cannot access 'archive/output000/access-om3.cice.r.*': No such file or directory
cal: unknown month name: om3.cice*.????
These grids are specified in config.yaml
- /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
- /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
- /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc
In an earlier global MOM6 run (1deg_jra55do_ryf
) the following files exist in /scratch/gb02/pag548/access-om3/archive/1deg_jra55do_ryf/output000/
access-om3.cice.1900-01.nc
access-om3.cicem.1900-01.nc
I tried to copy them into
/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/
But I generate the same error.
I’m guessing I’ve completed the first stage in the model run because the stdout/sterr files 1deg_jra55do_ia.*
are fully written. The set of PBS job files build_intake_ds.sh.*
have been created. The stderr from the build_intake_ds.sh.e*
file is
Downloading data from 'https://raw.githubusercontent.com/ACCESS-NRI/schema/e9055da95093ec2faa555c090fc5af17923d1566/au.org.access-nri/model/o
utput/file-metadata/1-0-1.json' to file '/home/548/pag548/.cache/pooch/8e3c08344f0361af426ae185c86d446e-1-0-1.json'.
Traceback (most recent call last):
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
sock = connection.create_connection(
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_
connection
raise err
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_
connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable
Are there extra NCI projects or ACCESS-NRI permissions I need access to?
Hi @Paul.Gregory
This is sounding promising. Are there netcdf files related to MOM in the output000 folder?
Thanks for bringing this error up as a few people have received this error – it hasn’t been fatal for the model run but it is something we should look into.
We are not running CICE – so the call to CICE outputs shouldn’t be happening – but as we are the first people trying to run without CICE I am wondering if there is something in the workflow that is hardwired to include CICE? @Aidan or @anton might know more – and if it is in the realm of something we can fix in our configurations or whether I should raise this one somewhere?
Note that Monday is a public holiday in Canberra so responses may be delayed.
Ok that’s a good reminder. I’ll double check the inputs to ensure all cice
related processes aren’t active.
That error is because you’re trying to download a file from GitHub and I’m guessing this was on a PBS compute note, which has no internet access.
@CharlesTurner might have some idea if this is expected.
It looks like you’re using the hh5
conda/analysis3
environment:
Are we not using the xp65
environments for the intake catalogue generation?
I’d need to look into the cice
issue, but @anton may have some idea.
Yeah this an issue that is fixed in more recent versions of access-nri-intake - see Ship schema with package · Issue #185 · ACCESS-NRI/access-nri-intake-catalog · GitHub.
The version in hh5
environment is quite old now. Using the xp65
environment should fix the issue. I can provide more details on Tuesday if someone doesn’t do so first.
Regarding the CICE error, I suspect the Payu configuration still includes running a userscript after the model completes to postprocess CICE output. But because you have no CICE output, this fails. Again this is easy to fix and I can provide more detail on Tuesday.
Morning.
I tried using payu
with the xp65
modules.
$ module use /g/data/xp65/public/modules
$ module load conda/analysis3
Then load payu
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
But this generates a PE error.
FATAL from PE 0: time_interp_external 2: time 734872 (20130105.000050 is after range of list 734868-734872(20130101.000000 - 201301
05.000000),file=INPUT/forcing_obc_segment_001.nc,field=u_segment_001
in mom_cap.F90
Are there other steps required when using xp65
conda? Here are my loaded modules at runtime.
$ module list
Currently Loaded Modulefiles:
1) pbs 2) singularity 3) conda/analysis3-24.12(access-med:analysis3) 4) payu/dev-20250220T210827Z-39e4b9b(dev)
EDIT : I tried to run again with the standard config and I generated this error again.
So it looks like I’ve broken something and this PE error is not related to xp65
conda.