ACCESS-ROM3 setup instructions

I’ve made those changes to the domain sizing to no avail.

Does the stack trace in access-om3.err show anything useful ?

If it’s clear the failure is in a component - look in the work/logs folder for that component.

If there’s no line numbers in the trace, or the error looks related to esmf or nuopc, have a look for PETxxxx files in the work directory, and see what they say.

Thanks for that suggestion @anton

The stack trace in access-om3.err contains the following

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00001477015E3D10  Unknown               Unknown  Unknown
libmpi.so.40.30.5  000014770209A9DE  Unknown               Unknown  Unknown
libopen-pal.so.40  00001476FC7CDD33  opal_progress         Unknown  Unknown
libopen-pal.so.40  00001476FC7CDEE5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.5  000014770209E4F8  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.5  00001477020AAB66  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.5  000014770207BFD0  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    00001477023CA80E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002D67DC2  mpp_mod_mp_get_pe         134  mpp_util_mpi.inc
access-om3-MOM6    0000000002E316EF  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002C507B6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001B8D94B  mom_cap_mod_mp_in         537  mom_cap.F90

Line 537 of ./config_src/drivers/nuopc_cap/mom_cap.F90 is

         call set_calendar_type (NOLEAP)

which is embedded in some logic to determine the kind of calendar. I’m not sure if that line reference (taken from MOM6/config_src/drivers/nuopc_cap/mom_cap.F90 at dev/access · ACCESS-NRI/MOM6 · GitHub) is relevant to what I’m using, as that line above is contained in

subroutine InitializeAdvertise

which isn’t referred to in the stack trace.

Here are the contents of the PET00.ESMF_LogFile

$ more work/PET00.ESMF_LogFile 
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:2108 Invalid argument  - Fixx_rofi is not a StandardName in the NUOPC_FieldDictionary!
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:486 Invalid argument  - Passing error in return code
20250225 143228.947 ERROR            PET00 med.F90:913 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv03p1' Initialize for modelComp 1: MED did not return ESMF_SUCCESS
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1331 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCC
ESS
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1326 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 esmApp.F90:134 Invalid argument  - Passing error in return code
20250225 143228.948 INFO             PET00 Finalizing ESMF

I was doing some work w/ACCESS-CM3 in my home directory. I’ve now restarted that work on a seperate drive (/g/data/gb02). Maybe it’s best to purge what I’ve done so far in my home directories and start afresh.

This normally means the fd.yaml is inconsistent with the executable version being used.

Its a bit hard to connect this with the stack trace though, possibly the stack trace is not for the processor that cause the abort? It might be just waiting at this point and told to abort by a different processor.

The lines numbers are modified by the patches at build time (currently access-om3/MOM6/patches/mom_cap.F90.patch at 4f278cc1af1c278a765f5f9738add889d3166ed5 · COSIMA/access-om3 · GitHub ) so can be quite hard to follow.

@anton - there were some recent changed to fd.yaml here:

and we are using the prerelease module:

modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Is there a chance that these are now inconsistent?

It depends on which version of CMEPS is used in pr30-5. The major changes in fd.yaml occurred in cmeps 0.14.60. You can check the differences between cmeps 0.14.59 and 0.14.60 here Comparing cmeps0.14.59...cmeps0.14.60 · ESCOMP/CMEPS · GitHub

It should be ok,

access-om3/pr30-5

uses Release 0.3.1 · COSIMA/access-om3 · GitHub and the fd.yaml in the regional branch is consistent with that release.

It sound like @Paul.Gregory might have accidentally used one from a CM3 test branch

Ahh – Thanks @anton and @minghangli the instructions actually point to the dev-1deg_jra55do_iaf branch (to reduce the number of branches that need updating) - so this may be the issue!

@Paul.Gregory – when you rerun, can you please try switching which branch you download

Under the heading

“Download your other configuration files from an ACCESS_OM3 run”

Can you swap


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-1deg_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

To


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

The remainder of the instructions may be a little different as the original text that you are changing will be different (and some of the changes may now not be necessary).

Oh I see! - This branch should run fine with the default binary (2025.01.0) then and not need the one in pr30-5 as mom_symmettric is now on by default

Even better! Thanks Anton
@Paul.Gregory – an alternative (and better) thing to try

In your config.yaml file can you change to this

modules:
    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5

Ok. Here are my morning’s efforts.

  1. Delete my ~/access-om3/ directory
  2. From my home directory:
$ git clone --branch dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/

Then.

mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

Now to edit the input files.

In MOM_input

  • All paths are correct, i.e. no need to remove ‘forcing/’ directory.
  • There are no OBC_SEGMENT entries.
  • The NUOPC section already exists at the end of the MOM_input file.

In config.yaml

  • Change the scratch path to /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/
  • exe: access-om3-MOM6 is already fixed.
  • Change the module path to
    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5
  • setup is already commented out.

In datm_in

  • The mask and mesh files are already set. Note - they are the same file.
  • Set nx_global and ny_global to 140 and 249

In drof_in

  • The mask and mesh files are already set. Note - they are the same file.
  • Set nx_global and ny_global to 140 and 249

In input.nml

  • parameter_filename is already set.

In nuopcy.runconfig

  • ocn_ntasks = 100 already set
  • ocn_rootpe = 0 already set
  • start_ymd = 20130101 already set
  • stop_n = 2 already set
  • stop_option = ndays already set
  • restart_n = 2 already set
  • restart_option = ndays already set
  • mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc already set
  • mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc already set
  • component_list: MED ATM OCN ROF already set
  • ICE_model = sice already set

In nuopc.runseq

  • already cleared of ‘ice’ entries

In diag_table

  • output options set

Now to run from ~/access-rom3

Loading payu/dev-20250220T210827Z-39e4b9b
  ERROR: payu/dev-20250220T210827Z-39e4b9b cannot be loaded due to a conflict.
    HINT: Might try "module unload payu/1.1.5" first.

ok

$ module list
Currently Loaded Modulefiles:
 1) pbs  
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
Loading payu/dev-20250220T210827Z-39e4b9b
  Loading requirement: singularity
$ payu setup
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
payu: error: work path already exists: /scratch/gb02/pag548/access-om3/work/access-rom3.
             payu sweep and then payu run

$ payu sweep
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Removing work path /scratch/gb02/pag548/access-om3/work/access-rom3
Removing symlink /home/548/pag548/access-om3/access-rom3/work

$ payu run
payu: warning: Job request includes 44 unused CPUs.
payu: warning: CPU request increased from 100 to 144
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P gb02 -l walltime=01:00:00 -l ncpus=144 -l mem=100GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/vk83/prerelease/modules:/g/data/vk83/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/qv56+gdata/vk83 -- /g/data/vk83/prerelease/./apps/conda_scripts/payu-dev-20250220T210827Z-39e4b9b.d/bin/python /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin/payu-run
135974940.gadi-pbs

Error remains the same. Stack trace from access-om3.err

[gadi-cpu-clx-2426.gadi.nci.org.au:1404924] PMIX ERROR: UNREACHABLE in file /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c a
t line 2198
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  000014F8FE573D10  Unknown               Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02C169  Unknown               Unknown  Unknown
libopen-pal.so.40  000014F8F9D68923  opal_progress         Unknown  Unknown
libopen-pal.so.40  000014F8F9D68AD5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02FC78  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF03C346  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF00DA00  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    000014F8FF35C81E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002F59512  mpp_mod_mp_get_pe         138  mpp_util_mpi.inc
access-om3-MOM6    0000000003025B48  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002E302F6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001BE52F0  mom_cap_mod_mp_in         545  mom_cap.F90

Cheers

@Paul.Gregory thanks for a thorough description! Actually, it is best if you only do one of my suggested changes. The issue was are tyring to test now is that we think the branch we were using is not compatible with the executable we were using. So we need to update either the branch or the executable.

Can you try switching config.yaml back to:

 modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Sorry for the confusion!

Ok now I have progress. The job runs for about a minute. Things begin to happen, but it still fails.

The error from access-om3.err is

FATAL from PE    82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =       81         1         0

FATAL from PE    86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =      306         1         0

Is this an error in my config files?

Or an issue with the regional meshes generated downstream of the COSIMA mom6 notebook ?

I note there are two files in my INPUT_DIR:
access-rom3-nomask-ESMFmesh.nc
access-rom3-ESMFmesh.nc

Yet I only refer to one of them in my input configurations, e.g. from datm_in:

  model_maskfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"
  model_meshfile = "./INPUT/access-rom3-nomask-ESMFmesh.nc"

Note the inputs in my config.yaml are:

    - /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-datm-ESMFmesh.nc
    - /g/data/vk83/configurations/inputs/access-om3/share/meshes/share/2024.09.16/JRA55do-drof-ESMFmesh.nc
    - /g/data/vk83/configurations/inputs/access-om3/share/grids/global.1deg/2020.10.22/topog.nc
    - /g/data/vk83/configurations/inputs/access-om3/mom/surface_salt_restoring/global.1deg/2020.05.30/salt_sfc_restore.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc
    - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/hgrid.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/vcoord.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/bathymetry.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_tracers.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_eta.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_vel.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_001.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_002.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_003.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_004.nc  
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/grid_spec.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/ocean_mosaic.nc 
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-ESMFmesh.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/access-rom3-nomask-ESMFmesh.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/land_mask.nc 
1 Like

In nuopc.runconfig, there should be two lines using the version with the mask:

1 Like

Yep. That’s already set.

    mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc
    mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc    

I’ll inspect all my mesh/domain .nc files to check the dimensions.

Just bumping this. I still can’t get this configuration to work. If someone is able to upload a working configuration somewhere, it would be helpful to diagnose my problems.

For the record, I had a look at my mesh files. The bathymetry.nc, land_mask.nc, ocean_mask.nc files all have the same dimensions : 140 x 249.

The access-rom3-ESMFmesh.nc is unstructured therefore has no fixed dimensions in the x and y dimensions.

The error message:

FATAL from PE 82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 81 1 0 

is generated from the following code in subroutine InitializeRealize in mom_cap.F90

 if (abs(maskmesh(n) - mask(n)) > 0) then
  frmt = "('ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - "//&
  "MOM n, maskMesh(n), mask(n) = ',3(i8,2x))"
  write(err_msg, frmt)n,maskmesh(n),mask(n)

So in this error occurs for two values of n (81, 306) where n ranges from 1 tonumownedelements , where this value is computed using the external ESMF module ESMF_MeshGet.

In both these cases, the value of maskMesh(n)=1 where mask(n)=0.

Based on @mmr0’s running on a simple grid and smaller domain, I decided to reduce the job CPUs to 48. This generated a very different error:

FATAL from PE    11: MOM_domains_init: The product of the two components of layout,   10  10, is not the number of PEs used,    48.

FATAL from PE    15: MOM_domains_init: The product of the two components of layout,   10  10, is not the number of PEs used,    48.
etc

This error is generated from subroutine MOM_domains_init in MOM_domains.F90

    if (layout(1)*layout(2) /= PEs_used .and. (.not. mask_table_exists) ) then
      write(mesg,'("MOM_domains_init: The product of the two components of layout, ", &
            &      2i4,", is not the number of PEs used, ",i5,".")') &
            layout(1), layout(2), PEs_used
      call MOM_error(FATAL, mesg)
    endif

So there are clear rules regarding the components of layout an PE numbers.

The mask_table file is specified in MOM_layout:

MASKTABLE = mask_table.2.10x10

This is an ASCII file that contains the following:

$ more mask_table.2.10x10 
2
10, 10
5,7
6,7

The logic of the above Fortan code states that ERROR message can only ever be printed if mask_table_exists is false, i.e. the code ignores the presence of the mask_table.2.10x10 file. Which is interesting.

I commented out the MASKTABLE line in MOM_layout but it doesn’t seem to make a difference.

The MOM_domain file does contain the following:

LAYOUT = 10,10    

So I did some digging into how MOM6 reads the input files. I found the logic which reads in MOM_domain, that is contained in:

  • subroutines get_param_real, get_param_int etc. in MOM_file_parser.f90
  • subroutine initialize_MOM in MOM.F90.
  • subroutine MOM_layout in MOM_domains.F90`

But I haven’t yet been able to find any code which reads in datm_in and drof_in.

I would like to try a configuration that enforces 100 CPUs to align with the 10x10 layout, but payu contains the following code:

    # Increase the CPUs to accommodate the cpu-per-node request
    if n_cpus > max_cpus_per_node and (node_increase or node_misalignment):

        # Number of requested nodes
        n_nodes = 1 + (n_cpus - 1) // n_cpus_per_node
        n_cpu_request = max_cpus_per_node * n_nodes
        n_inert_cpus = n_cpu_request - n_cpus

        print('payu: warning: Job request includes {n} unused CPUs.'
              ''.format(n=n_inert_cpus))

        # Increase CPU request to match the effective node request
        n_cpus = max_cpus_per_node * n_nodes

        # Update the ncpus field in the config
        if n_cpus != n_cpus_request:
            print('payu: warning: CPU request increased from {n_req} to {n}'
                  ''.format(n_req=n_cpus_request, n=n_cpus))

So it always bumps the 100 CPUs up to 144.

@mmr0 - were you able to get your simple configuration running on 16 CPUs with a 4x4 layout?

Anyways - chasing my tail a bit here and I’m way out of my MOM6 depth (joke!).

I didn’t find much useful information at : Welcome to MOM6’s documentation! — MOM6 0.2a3 documentation

I can try running some of the test cases here : Getting started · NOAA-GFDL/MOM6-examples Wiki · GitHub

To see how the LAYOUT, CPUS and domain decomposition work.

Hi @Paul.Gregory - I haven’t forgotten! I am going to go through the instructions over the next couple of days to see if I can replicate your error.

@mmr0 has placed her configuration setup here

@Aidan has some instructions here on how to upload your configuration to github – any chance that you can please try uploading so we can see your files?

3 Likes
2 Likes

In the mesh files x and y are stacked. You can use numpy reshape to put them into a regular grid.

e.g. with an xarray dataset which contains the mesh file:

mesh_ds.elementMask.values.reshape(249,140)

arguments to reshape are (NY, NX)

This sounds like the problem - the ocean_mask.nc and access-rom3-ESMFmesh.nc are probably inconsistent.

Try setting

AUTO_MASKTABLE = True in MOM_input instead

This is correct behaviour - i.e. on gadi normal queue, for core counts greater than one node, a PBS job must request a multiple of one node (i.e. 48 processors). It doesn’t affect how many processors MOM uses - which is set in nuopc.runconfig

1 Like

Hi @mmr0. Your configuration setup here GitHub - mmr0/access-om3-configs at tassie-test is missing a MOM_layout file.

Could you add that to the repo please?

I’ve tried to replicate your configuration setup and your notebook (i.e. a 4x4 decomposition of Tasmania at 0.1 resolution) but it fails immediately with an MPI error when I provide with my own MOM_layout file.

I’m going to run your config from a clean directory (when it has a MOM_layout file) and see what happens.

Cheers