ACCESS-ROM3 setup instructions

Paul.Gregory · 25 February 2025 03:35

I’ve made those changes to the domain sizing to no avail.

anton · 25 February 2025 04:22

Does the stack trace in access-om3.err show anything useful ?

If it’s clear the failure is in a component - look in the work/logs folder for that component.

If there’s no line numbers in the trace, or the error looks related to esmf or nuopc, have a look for PETxxxx files in the work directory, and see what they say.

Paul.Gregory · 25 February 2025 05:12

Thanks for that suggestion @anton

The stack trace in access-om3.err contains the following

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00001477015E3D10  Unknown               Unknown  Unknown
libmpi.so.40.30.5  000014770209A9DE  Unknown               Unknown  Unknown
libopen-pal.so.40  00001476FC7CDD33  opal_progress         Unknown  Unknown
libopen-pal.so.40  00001476FC7CDEE5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.5  000014770209E4F8  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.5  00001477020AAB66  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.5  000014770207BFD0  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    00001477023CA80E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002D67DC2  mpp_mod_mp_get_pe         134  mpp_util_mpi.inc
access-om3-MOM6    0000000002E316EF  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002C507B6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001B8D94B  mom_cap_mod_mp_in         537  mom_cap.F90

Line 537 of ./config_src/drivers/nuopc_cap/mom_cap.F90 is

         call set_calendar_type (NOLEAP)

which is embedded in some logic to determine the kind of calendar. I’m not sure if that line reference (taken from MOM6/config_src/drivers/nuopc_cap/mom_cap.F90 at dev/access · ACCESS-NRI/MOM6 · GitHub) is relevant to what I’m using, as that line above is contained in

subroutine InitializeAdvertise

which isn’t referred to in the stack trace.

Here are the contents of the PET00.ESMF_LogFile

$ more work/PET00.ESMF_LogFile 
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:2108 Invalid argument  - Fixx_rofi is not a StandardName in the NUOPC_FieldDictionary!
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:486 Invalid argument  - Passing error in return code
20250225 143228.947 ERROR            PET00 med.F90:913 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv03p1' Initialize for modelComp 1: MED did not return ESMF_SUCCESS
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1331 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCC
ESS
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1326 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 esmApp.F90:134 Invalid argument  - Passing error in return code
20250225 143228.948 INFO             PET00 Finalizing ESMF

I was doing some work w/ACCESS-CM3 in my home directory. I’ve now restarted that work on a seperate drive (/g/data/gb02). Maybe it’s best to purge what I’ve done so far in my home directories and start afresh.

anton · 25 February 2025 05:38

This normally means the fd.yaml is inconsistent with the executable version being used.

Its a bit hard to connect this with the stack trace though, possibly the stack trace is not for the processor that cause the abort? It might be just waiting at this point and told to abort by a different processor.

The lines numbers are modified by the patches at build time (currently access-om3/MOM6/patches/mom_cap.F90.patch at 4f278cc1af1c278a765f5f9738add889d3166ed5 · COSIMA/access-om3 · GitHub ) so can be quite hard to follow.

helen · 25 February 2025 22:05

@anton - there were some recent changed to fd.yaml here:

and we are using the prerelease module:

modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Is there a chance that these are now inconsistent?

minghangli · 25 February 2025 22:12

It depends on which version of CMEPS is used in pr30-5. The major changes in fd.yaml occurred in cmeps 0.14.60. You can check the differences between cmeps 0.14.59 and 0.14.60 here Comparing cmeps0.14.59...cmeps0.14.60 · ESCOMP/CMEPS · GitHub

anton · 25 February 2025 22:19

It should be ok,

access-om3/pr30-5

uses Release 0.3.1 · COSIMA/access-om3 · GitHub and the fd.yaml in the regional branch is consistent with that release.

It sound like @Paul.Gregory might have accidentally used one from a CM3 test branch

helen · 25 February 2025 22:42

Ahh – Thanks @anton and @minghangli the instructions actually point to the dev-1deg_jra55do_iaf branch (to reduce the number of branches that need updating) - so this may be the issue!

@Paul.Gregory – when you rerun, can you please try switching which branch you download

Under the heading

“Download your other configuration files from an ACCESS_OM3 run”

Can you swap


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-1deg_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

To


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

The remainder of the instructions may be a little different as the original text that you are changing will be different (and some of the changes may now not be necessary).

anton · 25 February 2025 22:53

Oh I see! - This branch should run fine with the default binary (2025.01.0) then and not need the one in pr30-5 as mom_symmettric is now on by default

helen · 25 February 2025 23:06

Even better! Thanks Anton
@Paul.Gregory – an alternative (and better) thing to try

In your config.yaml file can you change to this

modules:
    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5

Paul.Gregory · 26 February 2025 01:10

Ok. Here are my morning’s efforts.

Delete my ~/access-om3/ directory
From my home directory:

$ git clone --branch dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/

Then.

mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

Now to edit the input files.

In MOM_input

All paths are correct, i.e. no need to remove ‘forcing/’ directory.
There are no OBC_SEGMENT entries.
The NUOPC section already exists at the end of the MOM_input file.

In config.yaml

Change the scratch path to /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/
exe: access-om3-MOM6 is already fixed.
Change the module path to

    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5

setup is already commented out.

In datm_in

The mask and mesh files are already set. Note - they are the same file.
Set nx_global and ny_global to 140 and 249

In drof_in

The mask and mesh files are already set. Note - they are the same file.
Set nx_global and ny_global to 140 and 249

In input.nml

parameter_filename is already set.

In nuopcy.runconfig

ocn_ntasks = 100 already set
ocn_rootpe = 0 already set
start_ymd = 20130101 already set
stop_n = 2 already set
stop_option = ndays already set
restart_n = 2 already set
restart_option = ndays already set
mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc already set
mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc already set
component_list: MED ATM OCN ROF already set
ICE_model = sice already set

In nuopc.runseq

already cleared of ‘ice’ entries

In diag_table

output options set

Now to run from ~/access-rom3

Loading payu/dev-20250220T210827Z-39e4b9b
  ERROR: payu/dev-20250220T210827Z-39e4b9b cannot be loaded due to a conflict.
    HINT: Might try "module unload payu/1.1.5" first.

ok

$ module list
Currently Loaded Modulefiles:
 1) pbs  
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
Loading payu/dev-20250220T210827Z-39e4b9b
  Loading requirement: singularity
$ payu setup
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
payu: error: work path already exists: /scratch/gb02/pag548/access-om3/work/access-rom3.
             payu sweep and then payu run

$ payu sweep
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Removing work path /scratch/gb02/pag548/access-om3/work/access-rom3
Removing symlink /home/548/pag548/access-om3/access-rom3/work

$ payu run
payu: warning: Job request includes 44 unused CPUs.
payu: warning: CPU request increased from 100 to 144
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P gb02 -l walltime=01:00:00 -l ncpus=144 -l mem=100GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/vk83/prerelease/modules:/g/data/vk83/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/qv56+gdata/vk83 -- /g/data/vk83/prerelease/./apps/conda_scripts/payu-dev-20250220T210827Z-39e4b9b.d/bin/python /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin/payu-run
135974940.gadi-pbs

Error remains the same. Stack trace from access-om3.err

[gadi-cpu-clx-2426.gadi.nci.org.au:1404924] PMIX ERROR: UNREACHABLE in file /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c a
t line 2198
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  000014F8FE573D10  Unknown               Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02C169  Unknown               Unknown  Unknown
libopen-pal.so.40  000014F8F9D68923  opal_progress         Unknown  Unknown
libopen-pal.so.40  000014F8F9D68AD5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02FC78  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF03C346  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF00DA00  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    000014F8FF35C81E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002F59512  mpp_mod_mp_get_pe         138  mpp_util_mpi.inc
access-om3-MOM6    0000000003025B48  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002E302F6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001BE52F0  mom_cap_mod_mp_in         545  mom_cap.F90

Cheers

helen · 26 February 2025 01:26

@Paul.Gregory thanks for a thorough description! Actually, it is best if you only do one of my suggested changes. The issue was are tyring to test now is that we think the branch we were using is not compatible with the executable we were using. So we need to update either the branch or the executable.

Can you try switching config.yaml back to:

 modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Sorry for the confusion!

helen · 3 April 2025 22:21

21 posts were split to a new topic: ESMF mesh and MOM6 domain masks are inconsistent

Paul.Gregory · 7 March 2025 05:02

Ok I’ve got @mmr0 's config up and running now. It seems to make progress and actually run because the following directory now exists:

/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/

However the run fails after a few minutes with

ls: cannot access 'archive/output000/access-om3.cice.r.*': No such file or directory
cal: unknown month name: om3.cice*.????

These grids are specified in config.yaml

    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc

In an earlier global MOM6 run (1deg_jra55do_ryf) the following files exist in /scratch/gb02/pag548/access-om3/archive/1deg_jra55do_ryf/output000/

access-om3.cice.1900-01.nc
access-om3.cicem.1900-01.nc

I tried to copy them into

/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/

But I generate the same error.

I’m guessing I’ve completed the first stage in the model run because the stdout/sterr files 1deg_jra55do_ia.* are fully written. The set of PBS job files build_intake_ds.sh.* have been created. The stderr from the build_intake_ds.sh.e* file is

Downloading data from 'https://raw.githubusercontent.com/ACCESS-NRI/schema/e9055da95093ec2faa555c090fc5af17923d1566/au.org.access-nri/model/o
utput/file-metadata/1-0-1.json' to file '/home/548/pag548/.cache/pooch/8e3c08344f0361af426ae185c86d446e-1-0-1.json'.
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_
connection
    raise err
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_
connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

Are there extra NCI projects or ACCESS-NRI permissions I need access to?

helen · 7 March 2025 05:45

Hi @Paul.Gregory
This is sounding promising. Are there netcdf files related to MOM in the output000 folder?

Thanks for bringing this error up as a few people have received this error – it hasn’t been fatal for the model run but it is something we should look into.

We are not running CICE – so the call to CICE outputs shouldn’t be happening – but as we are the first people trying to run without CICE I am wondering if there is something in the workflow that is hardwired to include CICE? @Aidan or @anton might know more – and if it is in the realm of something we can fix in our configurations or whether I should raise this one somewhere?

Note that Monday is a public holiday in Canberra so responses may be delayed.

Paul.Gregory · 7 March 2025 05:55

Ok that’s a good reminder. I’ll double check the inputs to ensure all cice related processes aren’t active.

Aidan · 7 March 2025 06:38

That error is because you’re trying to download a file from GitHub and I’m guessing this was on a PBS compute note, which has no internet access.

@CharlesTurner might have some idea if this is expected.

It looks like you’re using the hh5 conda/analysis3 environment:

Are we not using the xp65 environments for the intake catalogue generation?

I’d need to look into the cice issue, but @anton may have some idea.

dougiesquire · 7 March 2025 06:50

Yeah this an issue that is fixed in more recent versions of access-nri-intake - see Ship schema with package · Issue #185 · ACCESS-NRI/access-nri-intake-catalog · GitHub.

The version in hh5 environment is quite old now. Using the xp65 environment should fix the issue. I can provide more details on Tuesday if someone doesn’t do so first.

dougiesquire · 7 March 2025 06:57

Regarding the CICE error, I suspect the Payu configuration still includes running a userscript after the model completes to postprocess CICE output. But because you have no CICE output, this fails. Again this is easy to fix and I can provide more detail on Tuesday.

Paul.Gregory · 11 March 2025 00:17

Morning.

I tried using payu with the xp65 modules.

$ module use /g/data/xp65/public/modules
$ module load conda/analysis3

Then load payu

$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev

But this generates a PE error.

FATAL from PE     0: time_interp_external 2: time 734872 (20130105.000050 is after range of list 734868-734872(20130101.000000 - 201301
05.000000),file=INPUT/forcing_obc_segment_001.nc,field=u_segment_001

in mom_cap.F90

Are there other steps required when using xp65 conda? Here are my loaded modules at runtime.

$ module list
Currently Loaded Modulefiles:
 1) pbs   2) singularity   3) conda/analysis3-24.12(access-med:analysis3)   4) payu/dev-20250220T210827Z-39e4b9b(dev)

EDIT : I tried to run again with the standard config and I generated this error again.

So it looks like I’ve broken something and this PE error is not related to xp65 conda.

Topic		Replies	Views
Creating a regional ACCESS-OM3 TWG 21centuryweather	20	265	20 February 2025
ESMF mesh and MOM6 domain masks are inconsistent Regional MOM6	45	182	5 June 2025
21st March 2025 - How to run an ACCESS-OM2 or OM3 model 2025 training program cosima , access-om2 , payu , training	4	207	24 March 2025
COSIMA TWG Meeting Minutes 2023 TWG meeting , twg , notes , minutes	15	1759	11 December 2023
Setting build options for OM3 Regional MOM6 spack	47	234	4 August 2025

ACCESS-ROM3 setup instructions

Related topics