ACCESS-ROM3 setup instructions

I’ve made those changes to the domain sizing to no avail.

Does the stack trace in access-om3.err show anything useful ?

If it’s clear the failure is in a component - look in the work/logs folder for that component.

If there’s no line numbers in the trace, or the error looks related to esmf or nuopc, have a look for PETxxxx files in the work directory, and see what they say.

Thanks for that suggestion @anton

The stack trace in access-om3.err contains the following

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00001477015E3D10  Unknown               Unknown  Unknown
libmpi.so.40.30.5  000014770209A9DE  Unknown               Unknown  Unknown
libopen-pal.so.40  00001476FC7CDD33  opal_progress         Unknown  Unknown
libopen-pal.so.40  00001476FC7CDEE5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.5  000014770209E4F8  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.5  00001477020AAB66  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.5  000014770207BFD0  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    00001477023CA80E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002D67DC2  mpp_mod_mp_get_pe         134  mpp_util_mpi.inc
access-om3-MOM6    0000000002E316EF  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002C507B6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001B8D94B  mom_cap_mod_mp_in         537  mom_cap.F90

Line 537 of ./config_src/drivers/nuopc_cap/mom_cap.F90 is

         call set_calendar_type (NOLEAP)

which is embedded in some logic to determine the kind of calendar. I’m not sure if that line reference (taken from MOM6/config_src/drivers/nuopc_cap/mom_cap.F90 at dev/access · ACCESS-NRI/MOM6 · GitHub) is relevant to what I’m using, as that line above is contained in

subroutine InitializeAdvertise

which isn’t referred to in the stack trace.

Here are the contents of the PET00.ESMF_LogFile

$ more work/PET00.ESMF_LogFile 
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:2108 Invalid argument  - Fixx_rofi is not a StandardName in the NUOPC_FieldDictionary!
20250225 143228.947 ERROR            PET00 src/addon/NUOPC/src/NUOPC_Base.F90:486 Invalid argument  - Passing error in return code
20250225 143228.947 ERROR            PET00 med.F90:913 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv03p1' Initialize for modelComp 1: MED did not return ESMF_SUCCESS
20250225 143228.948 ERROR            PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1331 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2898 Invalid argument  - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCC
ESS
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1326 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Invalid argument  - Passing error in return code
20250225 143228.948 ERROR            PET00 esmApp.F90:134 Invalid argument  - Passing error in return code
20250225 143228.948 INFO             PET00 Finalizing ESMF

I was doing some work w/ACCESS-CM3 in my home directory. I’ve now restarted that work on a seperate drive (/g/data/gb02). Maybe it’s best to purge what I’ve done so far in my home directories and start afresh.

This normally means the fd.yaml is inconsistent with the executable version being used.

Its a bit hard to connect this with the stack trace though, possibly the stack trace is not for the processor that cause the abort? It might be just waiting at this point and told to abort by a different processor.

The lines numbers are modified by the patches at build time (currently access-om3/MOM6/patches/mom_cap.F90.patch at 4f278cc1af1c278a765f5f9738add889d3166ed5 · COSIMA/access-om3 · GitHub ) so can be quite hard to follow.

@anton - there were some recent changed to fd.yaml here:

and we are using the prerelease module:

modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Is there a chance that these are now inconsistent?

It depends on which version of CMEPS is used in pr30-5. The major changes in fd.yaml occurred in cmeps 0.14.60. You can check the differences between cmeps 0.14.59 and 0.14.60 here Comparing cmeps0.14.59...cmeps0.14.60 · ESCOMP/CMEPS · GitHub

It should be ok,

access-om3/pr30-5

uses Release 0.3.1 · COSIMA/access-om3 · GitHub and the fd.yaml in the regional branch is consistent with that release.

It sound like @Paul.Gregory might have accidentally used one from a CM3 test branch

Ahh – Thanks @anton and @minghangli the instructions actually point to the dev-1deg_jra55do_iaf branch (to reduce the number of branches that need updating) - so this may be the issue!

@Paul.Gregory – when you rerun, can you please try switching which branch you download

Under the heading

“Download your other configuration files from an ACCESS_OM3 run”

Can you swap


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-1deg_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

To


mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

The remainder of the instructions may be a little different as the original text that you are changing will be different (and some of the changes may now not be necessary).

Oh I see! - This branch should run fine with the default binary (2025.01.0) then and not need the one in pr30-5 as mom_symmettric is now on by default

Even better! Thanks Anton
@Paul.Gregory – an alternative (and better) thing to try

In your config.yaml file can you change to this

modules:
    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5

Ok. Here are my morning’s efforts.

  1. Delete my ~/access-om3/ directory
  2. From my home directory:
$ git clone --branch dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/

Then.

mkdir -p ~/access-om3

cd ~/access-om3

module use /g/data/vk83/modules

module load payu/1.1.5

payu clone -b expt -B dev-regional_jra55do_iaf https://github.com/ACCESS-NRI/access-om3-configs/ access-rom3

cd access-rom3

Now to edit the input files.

In MOM_input

  • All paths are correct, i.e. no need to remove ‘forcing/’ directory.
  • There are no OBC_SEGMENT entries.
  • The NUOPC section already exists at the end of the MOM_input file.

In config.yaml

  • Change the scratch path to /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/
  • exe: access-om3-MOM6 is already fixed.
  • Change the module path to
    use:
        - /g/data/vk83/modules
    load:
        - access-om3/2025.01.0
        - nco/5.0.5
  • setup is already commented out.

In datm_in

  • The mask and mesh files are already set. Note - they are the same file.
  • Set nx_global and ny_global to 140 and 249

In drof_in

  • The mask and mesh files are already set. Note - they are the same file.
  • Set nx_global and ny_global to 140 and 249

In input.nml

  • parameter_filename is already set.

In nuopcy.runconfig

  • ocn_ntasks = 100 already set
  • ocn_rootpe = 0 already set
  • start_ymd = 20130101 already set
  • stop_n = 2 already set
  • stop_option = ndays already set
  • restart_n = 2 already set
  • restart_option = ndays already set
  • mesh_mask = ./INPUT/access-rom3-ESMFmesh.nc already set
  • mesh_ocn = ./INPUT/access-rom3-ESMFmesh.nc already set
  • component_list: MED ATM OCN ROF already set
  • ICE_model = sice already set

In nuopc.runseq

  • already cleared of ‘ice’ entries

In diag_table

  • output options set

Now to run from ~/access-rom3

Loading payu/dev-20250220T210827Z-39e4b9b
  ERROR: payu/dev-20250220T210827Z-39e4b9b cannot be loaded due to a conflict.
    HINT: Might try "module unload payu/1.1.5" first.

ok

$ module list
Currently Loaded Modulefiles:
 1) pbs  
$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev
Loading payu/dev-20250220T210827Z-39e4b9b
  Loading requirement: singularity
$ payu setup
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
payu: error: work path already exists: /scratch/gb02/pag548/access-om3/work/access-rom3.
             payu sweep and then payu run

$ payu sweep
laboratory path:  /scratch/gb02/pag548/access-om3
binary path:  /scratch/gb02/pag548/access-om3/bin
input path:  /scratch/gb02/pag548/access-om3/input
work path:  /scratch/gb02/pag548/access-om3/work
archive path:  /scratch/gb02/pag548/access-om3/archive
Metadata and UUID generation is disabled. Experiment name used for archival: access-rom3
Removing work path /scratch/gb02/pag548/access-om3/work/access-rom3
Removing symlink /home/548/pag548/access-om3/access-rom3/work

$ payu run
payu: warning: Job request includes 44 unused CPUs.
payu: warning: CPU request increased from 100 to 144
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P gb02 -l walltime=01:00:00 -l ncpus=144 -l mem=100GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/vk83/prerelease/modules:/g/data/vk83/modules:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -l storage=gdata/qv56+gdata/vk83 -- /g/data/vk83/prerelease/./apps/conda_scripts/payu-dev-20250220T210827Z-39e4b9b.d/bin/python /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250220T210827Z-39e4b9b/bin/payu-run
135974940.gadi-pbs

Error remains the same. Stack trace from access-om3.err

[gadi-cpu-clx-2426.gadi.nci.org.au:1404924] PMIX ERROR: UNREACHABLE in file /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/opal/mca/pmix/pmix3x/pmix/src/server/pmix_server.c a
t line 2198
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  000014F8FE573D10  Unknown               Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02C169  Unknown               Unknown  Unknown
libopen-pal.so.40  000014F8F9D68923  opal_progress         Unknown  Unknown
libopen-pal.so.40  000014F8F9D68AD5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF02FC78  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF03C346  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.7  000014F8FF00DA00  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    000014F8FF35C81E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002F59512  mpp_mod_mp_get_pe         138  mpp_util_mpi.inc
access-om3-MOM6    0000000003025B48  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002E302F6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001BE52F0  mom_cap_mod_mp_in         545  mom_cap.F90

Cheers

@Paul.Gregory thanks for a thorough description! Actually, it is best if you only do one of my suggested changes. The issue was are tyring to test now is that we think the branch we were using is not compatible with the executable we were using. So we need to update either the branch or the executable.

Can you try switching config.yaml back to:

 modules:
    use:
        - /g/data/vk83/prerelease/modules
    load:
        - access-om3/pr30-5

Sorry for the confusion!

21 posts were split to a new topic: ESMF mesh and MOM6 domain masks are inconsistent

Ok I’ve got @mmr0 's config up and running now. It seems to make progress and actually run because the following directory now exists:

/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/

However the run fails after a few minutes with

ls: cannot access 'archive/output000/access-om3.cice.r.*': No such file or directory
cal: unknown month name: om3.cice*.????

These grids are specified in config.yaml

    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/grid.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/grids/global.1deg/2024.05.14/kmt.nc
    - /g/data/vk83/configurations/inputs/access-om3/cice/initial_conditions/global.1deg/2023.07.28/iced.1900-01-01-10800.nc

In an earlier global MOM6 run (1deg_jra55do_ryf) the following files exist in /scratch/gb02/pag548/access-om3/archive/1deg_jra55do_ryf/output000/

access-om3.cice.1900-01.nc
access-om3.cicem.1900-01.nc

I tried to copy them into

/scratch/gb02/pag548/access-om3/archive/access-rom3-MR/output000/

But I generate the same error.

I’m guessing I’ve completed the first stage in the model run because the stdout/sterr files 1deg_jra55do_ia.* are fully written. The set of PBS job files build_intake_ds.sh.* have been created. The stderr from the build_intake_ds.sh.e* file is

Downloading data from 'https://raw.githubusercontent.com/ACCESS-NRI/schema/e9055da95093ec2faa555c090fc5af17923d1566/au.org.access-nri/model/o
utput/file-metadata/1-0-1.json' to file '/home/548/pag548/.cache/pooch/8e3c08344f0361af426ae185c86d446e-1-0-1.json'.
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_
connection
    raise err
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_
connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

Are there extra NCI projects or ACCESS-NRI permissions I need access to?

Hi @Paul.Gregory
This is sounding promising. Are there netcdf files related to MOM in the output000 folder?

Thanks for bringing this error up as a few people have received this error – it hasn’t been fatal for the model run but it is something we should look into.

We are not running CICE – so the call to CICE outputs shouldn’t be happening – but as we are the first people trying to run without CICE I am wondering if there is something in the workflow that is hardwired to include CICE? @Aidan or @anton might know more – and if it is in the realm of something we can fix in our configurations or whether I should raise this one somewhere?

Note that Monday is a public holiday in Canberra so responses may be delayed.

Ok that’s a good reminder. I’ll double check the inputs to ensure all cice related processes aren’t active.

That error is because you’re trying to download a file from GitHub and I’m guessing this was on a PBS compute note, which has no internet access.

@CharlesTurner might have some idea if this is expected.

It looks like you’re using the hh5 conda/analysis3 environment:

Are we not using the xp65 environments for the intake catalogue generation?

I’d need to look into the cice issue, but @anton may have some idea.

Yeah this an issue that is fixed in more recent versions of access-nri-intake - see Ship schema with package · Issue #185 · ACCESS-NRI/access-nri-intake-catalog · GitHub.

The version in hh5 environment is quite old now. Using the xp65 environment should fix the issue. I can provide more details on Tuesday if someone doesn’t do so first.

1 Like

Regarding the CICE error, I suspect the Payu configuration still includes running a userscript after the model completes to postprocess CICE output. But because you have no CICE output, this fails. Again this is easy to fix and I can provide more detail on Tuesday.

1 Like

Morning.

I tried using payu with the xp65 modules.

$ module use /g/data/xp65/public/modules
$ module load conda/analysis3

Then load payu

$ module use /g/data/vk83/prerelease/modules
$ module load payu/dev

But this generates a PE error.

FATAL from PE     0: time_interp_external 2: time 734872 (20130105.000050 is after range of list 734868-734872(20130101.000000 - 201301
05.000000),file=INPUT/forcing_obc_segment_001.nc,field=u_segment_001

in mom_cap.F90

Are there other steps required when using xp65 conda? Here are my loaded modules at runtime.

$ module list
Currently Loaded Modulefiles:
 1) pbs   2) singularity   3) conda/analysis3-24.12(access-med:analysis3)   4) payu/dev-20250220T210827Z-39e4b9b(dev) 

EDIT : I tried to run again with the standard config and I generated this error again.

So it looks like I’ve broken something and this PE error is not related to xp65 conda.