I’ve made those changes to the domain sizing to no avail.
Does the stack trace in access-om3.err
show anything useful ?
If it’s clear the failure is in a component - look in the work/logs folder for that component.
If there’s no line numbers in the trace, or the error looks related to esmf or nuopc, have a look for PETxxxx
files in the work directory, and see what they say.
Thanks for that suggestion @anton
The stack trace in access-om3.err
contains the following
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 00001477015E3D10 Unknown Unknown Unknown
libmpi.so.40.30.5 000014770209A9DE Unknown Unknown Unknown
libopen-pal.so.40 00001476FC7CDD33 opal_progress Unknown Unknown
libopen-pal.so.40 00001476FC7CDEE5 ompi_sync_wait_mt Unknown Unknown
libmpi.so.40.30.5 000014770209E4F8 ompi_comm_nextcid Unknown Unknown
libmpi.so.40.30.5 00001477020AAB66 ompi_comm_create_ Unknown Unknown
libmpi.so.40.30.5 000014770207BFD0 PMPI_Comm_create_ Unknown Unknown
libmpi_mpifh.so 00001477023CA80E Unknown Unknown Unknown
access-om3-MOM6 0000000002D67DC2 mpp_mod_mp_get_pe 134 mpp_util_mpi.inc
access-om3-MOM6 0000000002E316EF mpp_mod_mp_mpp_in 80 mpp_comm_mpi.inc
access-om3-MOM6 0000000002C507B6 fms_mod_mp_fms_in 367 fms.F90
access-om3-MOM6 0000000001B8D94B mom_cap_mod_mp_in 537 mom_cap.F90
Line 537 of ./config_src/drivers/nuopc_cap/mom_cap.F90 is
call set_calendar_type (NOLEAP)
which is embedded in some logic to determine the kind of calendar. I’m not sure if that line reference (taken from MOM6/config_src/drivers/nuopc_cap/mom_cap.F90 at dev/access · ACCESS-NRI/MOM6 · GitHub) is relevant to what I’m using, as that line above is contained in
subroutine InitializeAdvertise
which isn’t referred to in the stack trace.
Here are the contents of the PET00.ESMF_LogFile
$ more work/PET00.ESMF_LogFile
20250225 143228.947 ERROR PET00 src/addon/NUOPC/src/NUOPC_Base.F90:2108 Invalid argument - Fixx_rofi is not a StandardName in the NUOPC_FieldDictionary!
20250225 143228.947 ERROR PET00 src/addon/NUOPC/src/NUOPC_Base.F90:486 Invalid argument - Passing error in return code
20250225 143228.947 ERROR PET00 med.F90:913 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument - Phase 'IPDv03p1' Initialize for modelComp 1: MED did not return ESMF_SUCCESS
20250225 143228.948 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1331 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCC
ESS
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1326 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Invalid argument - Passing error in return code
20250225 143228.948 ERROR PET00 esmApp.F90:134 Invalid argument - Passing error in return code
20250225 143228.948 INFO PET00 Finalizing ESMF
I was doing some work w/ACCESS-CM3 in my home directory. I’ve now restarted that work on a seperate drive (/g/data/gb02
). Maybe it’s best to purge what I’ve done so far in my home directories and start afresh.
This normally means the fd.yaml
is inconsistent with the executable version being used.
Its a bit hard to connect this with the stack trace though, possibly the stack trace is not for the processor that cause the abort? It might be just waiting at this point and told to abort by a different processor.
The lines numbers are modified by the patches at build time (currently access-om3/MOM6/patches/mom_cap.F90.patch at 4f278cc1af1c278a765f5f9738add889d3166ed5 · COSIMA/access-om3 · GitHub ) so can be quite hard to follow.
@anton - there were some recent changed to fd.yaml here:
and we are using the prerelease module:
modules:
use:
- /g/data/vk83/prerelease/modules
load:
- access-om3/pr30-5
Is there a chance that these are now inconsistent?
It depends on which version of CMEPS is used in pr30-5. The major changes in fd.yaml occurred in cmeps 0.14.60. You can check the differences between cmeps 0.14.59 and 0.14.60 here Comparing cmeps0.14.59...cmeps0.14.60 · ESCOMP/CMEPS · GitHub
It should be ok,
access-om3/pr30-5
uses Release 0.3.1 · COSIMA/access-om3 · GitHub and the fd.yaml in the regional branch is consistent with that release.
It sound like @Paul.Gregory might have accidentally used one from a CM3 test branch
Ahh – Thanks @anton and @minghangli the instructions actually point to the dev-1deg_jra55do_iaf branch (to reduce the number of branches that need updating) - so this may be the issue!
@Paul.Gregory – when you rerun, can you please try switching which branch you download
Under the heading
“Download your other configuration files from an ACCESS_OM3 run”
Can you swap
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-1deg_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
To
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
The remainder of the instructions may be a little different as the original text that you are changing will be different (and some of the changes may now not be necessary).
Oh I see! - This branch should run fine with the default binary (2025.01.0) then and not need the one in pr30-5 as mom_symmettric is now on by default
Even better! Thanks Anton
@Paul.Gregory – an alternative (and better) thing to try
In your config.yaml file can you change to this
modules:
use:
- /g/data/vk83/modules
load:
- access-om3/2025.01.0
- nco/5.0.5
Ok. Here are my morning’s efforts.
- Delete my
~/access-om3/ directory
- From my home directory:
$ git clone --branch dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/
Then.
mkdir -p ~/access-om3
cd ~/access-om3
module use /g/data/vk83/modules
module load payu/1.1.5
payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3
cd access-rom3
Now to edit the input files.
In MOM_input
- All paths are correct, i.e. no need to remove ‘forcing/’ directory.
- There are no OBC_SEGMENT entries.
- The NUOPC section already exists at the end of the MOM_input file.
In config.yaml
- Change the
scratch
path to/scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/
exe: access-om3-MOM6
is already fixed.- Change the module path to
use:
- /g/data/vk83/modules
load:
- access-om3/2025.01.0
- nco/5.0.5
setup
is already commented out.
In datm_in
- The mask and mesh files are already set. Note - they are the same file.
- Set
nx_global
andny_global
to 140 and 249
In drof_in
- The mask and mesh files are already set. Note - they are the same file.
- Set
nx_global
andny_global
to 140 and 249
In input.nml
parameter_filename
is already set.
In nuopcy.runconfig
ocn_ntasks = 100
already setocn_rootpe = 0
already setstart_ymd = 20130101
already setstop_n = 2
already setstop_option = ndays
already setrestart_n = 2
already setrestart_option = ndays
already setmesh_mask = ./INPUT/access-rom3-ESMFmesh.nc
already setmesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc
already setcomponent_list: MED ATM OCN ROF
already setICE_model = sice
already set
In nuopc.runseq
- already cleared of ‘ice’ entries
In diag_table
- output options set
Now to run from ~/access-rom3
Loading payu/dev-20250220T210827Z-39e4b9b
ERROR: payu/dev-20250220T210827Z-39e4b9b cannot be loaded due to a conflict.
HINT: Might try "module unload payu/1.1.5" first.
ok
$ module list
Currently Loaded Modulefiles:
1) pbs
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
Loading payu/dev-20250220T210827Z-39e4b9b
Loading requirement: singularity
$ payu setup
laboratory path: /scratch/gb02/pag548/access-om3
binary path: /scratch/gb02/pag548/access-om3/bin
input path: /scratch/gb02/pag548/access-om3/input
work path: /scratch/gb02/pag548/access-om3/work
archive path: /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
payu: error: work path already exists: /scratch/gb02/pag548/access-om3/work/access-rom3.
payu sweep and then payu run
$ payu sweep
laboratory path: /scratch/gb02/pag548/access-om3
binary path: /scratch/gb02/pag548/access-om3/bin
input path: /scratch/gb02/pag548/access-om3/input
work path: /scratch/gb02/pag548/access-om3/work
archive path: /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Removing work path /scratch/gb02/pag548/access-om3/work/access-rom3
Removing symlink /home/548/pag548/access-om3/access-rom3/work
$ payu run
payu: warning: Job request includes 44 unused CPUs.
payu: warning: CPU request increased from 100 to 144
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P gb02 -l walltime=01:00:00 -l ncpus=144 -l mem=100GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/vk83/prerelease/modules:/g/data/vk83/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/qv56+gdata/vk83 -- /g/data/vk83/prerelease/./apps/conda_scripts/payu-dev-20250220T210827Z-39e4b9b.d/bin/python /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin/payu-run
135974940.gadi-pbs
Error remains the same. Stack trace from access-om3.err
[gadi-cpu-clx-2426.gadi.nci.org.au:1404924] PMIX ERROR: UNREACHABLE in file /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c a
t line 2198
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 000014F8FE573D10 Unknown Unknown Unknown
libmpi.so.40.30.7 000014F8FF02C169 Unknown Unknown Unknown
libopen-pal.so.40 000014F8F9D68923 opal_progress Unknown Unknown
libopen-pal.so.40 000014F8F9D68AD5 ompi_sync_wait_mt Unknown Unknown
libmpi.so.40.30.7 000014F8FF02FC78 ompi_comm_nextcid Unknown Unknown
libmpi.so.40.30.7 000014F8FF03C346 ompi_comm_create_ Unknown Unknown
libmpi.so.40.30.7 000014F8FF00DA00 PMPI_Comm_create_ Unknown Unknown
libmpi_mpifh.so 000014F8FF35C81E Unknown Unknown Unknown
access-om3-MOM6 0000000002F59512 mpp_mod_mp_get_pe 138 mpp_util_mpi.inc
access-om3-MOM6 0000000003025B48 mpp_mod_mp_mpp_in 80 mpp_comm_mpi.inc
access-om3-MOM6 0000000002E302F6 fms_mod_mp_fms_in 367 fms.F90
access-om3-MOM6 0000000001BE52F0 mom_cap_mod_mp_in 545 mom_cap.F90
Cheers
@Paul.Gregory thanks for a thorough description! Actually, it is best if you only do one of my suggested changes. The issue was are tyring to test now is that we think the branch we were using is not compatible with the executable we were using. So we need to update either the branch or the executable.
Can you try switching config.yaml back to:
modules:
use:
- /g/data/vk83/prerelease/modules
load:
- access-om3/pr30-5
Sorry for the confusion!
Ok now I have progress. The job runs for about a minute. Things begin to happen, but it still fails.
The error from access-om3.err
is
FATAL from PE 82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 81 1 0
FATAL from PE 86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 306 1 0
Is this an error in my config files?
Or an issue with the regional meshes generated downstream of the COSIMA mom6 notebook ?
I note there are two files in my INPUT_DIR
:
access-rom3-nomask-ESMFmesh.nc
access-rom3-ESMFmesh.nc
Yet I only refer to one of them in my input configurations, e.g. from datm_in
:
model_maskfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"
model_meshfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"
Note the inputs in my config.yaml
are:
- /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-datm-ESMFmesh.nc
- /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-drof-ESMFmesh.nc
- /g/data/vk83/configurations/inputs/access-om3/share/grids/global.1deg/2020.10.22/topog.nc
- /g/data/vk83/configurations/inputs/access-om3/mom/surface_salt_restoring/global.1deg/2020.05.30/salt_sfc_restore.nc
- /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
- /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
- /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc
- /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/hgrid.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/vcoord.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/bathymetry.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_tracers.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_eta.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_vel.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_001.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_002.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_003.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_004.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/grid_spec.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/ocean_mosaic.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-ESMFmesh.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-nomask-ESMFmesh.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/land_mask.nc
In nuopc.runconfig, there should be two lines using the version with the mask:
Yep. That’s already set.
mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc
mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc
I’ll inspect all my mesh/domain .nc
files to check the dimensions.
Just bumping this. I still can’t get this configuration to work. If someone is able to upload a working configuration somewhere, it would be helpful to diagnose my problems.
For the record, I had a look at my mesh files. The bathymetry.nc
, land_mask.nc
, ocean_mask.nc
files all have the same dimensions : 140 x 249.
The access-rom3-ESMFmesh.nc
is unstructured therefore has no fixed dimensions in the x and y dimensions.
The error message:
FATAL from PE 82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 81 1 0
is generated from the following code in subroutine InitializeRealize
in mom_cap.F90
if (abs(maskmesh(n) - mask(n)) > 0) then
frmt = "('ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - "//&
"MOM n, maskMesh(n), mask(n) = ',3(i8,2x))"
write(err_msg, frmt)n,maskmesh(n),mask(n)
So in this error occurs for two values of n
(81, 306) where n
ranges from 1 tonumownedelements
, where this value is computed using the external ESMF
module ESMF_MeshGet
.
In both these cases, the value of maskMesh(n)=1
where mask(n)=0
.
Based on @mmr0’s running on a simple grid and smaller domain, I decided to reduce the job CPUs to 48. This generated a very different error:
FATAL from PE 11: MOM_domains_init: The product of the two components of layout, 10 10, is not the number of PEs used, 48.
FATAL from PE 15: MOM_domains_init: The product of the two components of layout, 10 10, is not the number of PEs used, 48.
etc
This error is generated from subroutine MOM_domains_init
in MOM_domains.F90
if (layout(1)*layout(2) /= PEs_used .and. (.not. mask_table_exists) ) then
write(mesg,'("MOM_domains_init: The product of the two components of layout, ", &
& 2i4,", is not the number of PEs used, ",i5,".")') &
layout(1), layout(2), PEs_used
call MOM_error(FATAL, mesg)
endif
So there are clear rules regarding the components of layout an PE numbers.
The mask_table
file is specified in MOM_layout
:
MASKTABLE = mask_table.2.10x10
This is an ASCII file that contains the following:
$ more mask_table.2.10x10
2
10, 10
5,7
6,7
The logic of the above Fortan
code states that ERROR message can only ever be printed if mask_table_exists
is false, i.e. the code ignores the presence of the mask_table.2.10x10
file. Which is interesting.
I commented out the MASKTABLE
line in MOM_layout
but it doesn’t seem to make a difference.
The MOM_domain
file does contain the following:
LAYOUT = 10,10
So I did some digging into how MOM6 reads the input files. I found the logic which reads in MOM_domain
, that is contained in:
- subroutines
get_param_real
,get_param_int
etc. inMOM_file_parser.f90
- subroutine
initialize_MOM
inMOM.F90
. - subroutine
MOM_layout
in MOM_domains.F90`
But I haven’t yet been able to find any code which reads in datm_in
and drof_in
.
I would like to try a configuration that enforces 100 CPUs to align with the 10x10 layout, but payu
contains the following code:
# Increase the CPUs to accommodate the cpu-per-node request
if n_cpus > max_cpus_per_node and (node_increase or node_misalignment):
# Number of requested nodes
n_nodes = 1 + (n_cpus - 1) // n_cpus_per_node
n_cpu_request = max_cpus_per_node * n_nodes
n_inert_cpus = n_cpu_request - n_cpus
print('payu: warning: Job request includes {n} unused CPUs.'
''.format(n=n_inert_cpus))
# Increase CPU request to match the effective node request
n_cpus = max_cpus_per_node * n_nodes
# Update the ncpus field in the config
if n_cpus != n_cpus_request:
print('payu: warning: CPU request increased from {n_req} to {n}'
''.format(n_req=n_cpus_request, n=n_cpus))
So it always bumps the 100 CPUs up to 144.
@mmr0 - were you able to get your simple configuration running on 16 CPUs with a 4x4 layout?
Anyways - chasing my tail a bit here and I’m way out of my MOM6 depth (joke!).
I didn’t find much useful information at : Welcome to MOM6’s documentation! — MOM6 0.2a3 documentation
I can try running some of the test cases here : Getting started · NOAA-GFDL/MOM6-examples Wiki · GitHub
To see how the LAYOUT, CPUS and domain decomposition work.
Hi @Paul.Gregory - I haven’t forgotten! I am going to go through the instructions over the next couple of days to see if I can replicate your error.
@mmr0 has placed her configuration setup here
@Aidan has some instructions here on how to upload your configuration to github – any chance that you can please try uploading so we can see your files?
- this notebook cosima-recipes/Recipes/regional-mom6-forced-by-access-om2.ipynb at main · mmr0/cosima-recipes · GitHub for generating the input files
In the mesh files x and y are stacked. You can use numpy reshape to put them into a regular grid.
e.g. with an xarray dataset which contains the mesh file:
mesh_ds.elementMask.values.reshape(249,140)
arguments to reshape are (NY, NX)
This sounds like the problem - the ocean_mask.nc
and access-rom3-ESMFmesh.nc
are probably inconsistent.
Try setting
AUTO_MASKTABLE = True
in MOM_input instead
This is correct behaviour - i.e. on gadi normal queue, for core counts greater than one node, a PBS job must request a multiple of one node (i.e. 48 processors). It doesn’t affect how many processors MOM uses - which is set in nuopc.runconfig
Hi @mmr0. Your configuration setup here GitHub - mmr0/access-om3-configs at tassie-test is missing a MOM_layout
file.
Could you add that to the repo please?
I’ve tried to replicate your configuration setup and your notebook (i.e. a 4x4 decomposition of Tasmania at 0.1 resolution) but it fails immediately with an MPI error when I provide with my own MOM_layout
file.
I’m going to run your config from a clean directory (when it has a MOM_layout
file) and see what happens.
Cheers