ACCESS-ROM3 setup instructions

Regarding a name - @mlipson mentioned there’s also a similar discussion happening for the regional land atmosphere model. Should we adopt the same naming convention for both?

We currently use AUTO_MASKTABLE for access-om3 0.25deg configurations, and it has performed well in my tests. You can find more details here Automated Runtime Land Block Elimination by alperaltuntas · Pull Request #263 · NCAR/MOM6 · GitHub. One thing to note is if you run the model with a small number of CPUs, the computational grid is divided into fewer domains or partitions. Since AUTO_MASKTABLE relies on having enough domains to apply land-sea masking correctly, this can cause issues with the masking process.

Hi @Helen

I’m back from holidays and I’ve worked through your extensive instructions. Thanks for making the effort to document them. Some changes I had to make.

The om3 config branch
dev-regional_jra55do_ryf
doesn’t exists, so I changed it to
dev-regional_jra55do_iaf

I didn’t compile my own MOM6 executable.

After running the COSIMA regional-mom6 notebook, I had to manually change the paths in my config.yaml file from:

    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/init_eta.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/init_vel.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_001.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_002.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_003.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_004.nc  

to

  - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_tracers.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_eta.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_vel.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_001.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_002.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_003.nc
    - /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_004.nc 

(i.e. adding the additional forcing subdirectory)

However, the payu run job submission fails. From the master stderr file.

$ more 1deg_jra55do_ia.e135836516 
Loading access-om3/pr30-5
  Loading requirement: access-om3-nuopc/0.3.1-fvr7qaw
Currently Loaded Modulefiles:
 1) access-om3-nuopc/0.3.1-fvr7qaw   3) nco/5.0.5   5) openmpi/4.1.7(default)  
 2) access-om3/pr30-5                4) pbs        
payu: Model exited with error code 1; aborting.

The archived stderr files suggests I have some MPI issues.

$ more archive/error_logs/access-om3.135836516.gadi-pbs.err 
 (t_initf) Read in prof_inparm namelist from: drv_in
 (t_initf) Using profile_disable=          F
 (t_initf)       profile_timer=                      4
 (t_initf)       profile_depth_limit=                4
 (t_initf)       profile_detail_limit=               2
 (t_initf)       profile_barrier=          F
 (t_initf)       profile_outpe_num=                  1
 (t_initf)       profile_outpe_stride=               0
 (t_initf)       profile_single_file=      F
 (t_initf)       profile_global_stats=     T
 (t_initf)       profile_ovhd_measurement= F
 (t_initf)       profile_add_detail=       F
 (t_initf)       profile_papi_enable=      F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 22 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-2325.gadi.nci.org.au:519572] PMIX ERROR: UNREACHABLE in file /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/opal/mca/p
mix/pmix3x/pmix/src/server/pmix_server.c at line 2198
...
etc.

Followed by

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.28.s  000014AF3CCA0D10  Unknown               Unknown  Unknown
libpthread-2.28.s  000014AF3CC9A2E5  Unknown               Unknown  Unknown
libmpi.so.40.30.5  000014AF3D757CF8  Unknown               Unknown  Unknown
libopen-pal.so.40  000014AF37E8AD33  opal_progress         Unknown  Unknown
libopen-pal.so.40  000014AF37E8AEE5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.5  000014AF3D75B4F8  ompi_comm_nextcid     Unknown  Unknown
libmpi.so.40.30.5  000014AF3D767B66  ompi_comm_create_     Unknown  Unknown
libmpi.so.40.30.5  000014AF3D738FD0  PMPI_Comm_create_     Unknown  Unknown
libmpi_mpifh.so    000014AF3DA8780E  Unknown               Unknown  Unknown
access-om3-MOM6    0000000002D67DC2  mpp_mod_mp_get_pe         134  mpp_util_mpi.inc
access-om3-MOM6    0000000002E316EF  mpp_mod_mp_mpp_in          80  mpp_comm_mpi.inc
access-om3-MOM6    0000000002C507B6  fms_mod_mp_fms_in         367  fms.F90
access-om3-MOM6    0000000001B8D94B  mom_cap_mod_mp_in         537  mom_cap.F90
access-om3-MOM6    00000000009F07F2  Unknown               Unknown  Unknown

Any ideas where I’ve tripped up the Open MPI setup? :thinking:

Hi Paul, I also had this issue (sorry I didn’t document). I found I needed to increase the memory in config.yaml for regional OM3 versus the old regional model. That is my usual solution to this particular error

My domain is small & coarse for testing, and my PBS config looks like this

queue: normal
ncpus: 16
jobfs: 10GB
mem: 128GB
1 Like

Thanks @Paul.Gregory and @mmr0
Paul, when you get a chance can you please report back on if Madi’s suggestion worked? If it doesn’t then I will investigate further but if it does then we should edit the wiki to suggest a larger memory!

Good catch! - thanks I edited the name the other day.

1 Like

This is going to be an ongoing issue until a large pull request goes through. The pull request places the files in a different folder than the original which means that at the moment half the people using this get the outputs in the forcing files and half do not. Once the pull request goes through we can standardise everything a bit better.

1 Like

I still can’t run my configuration. When I try @mmr0’s config payu setup generates the error

ValueError: Insufficient cpus for the ocn pelayout in nuopc.runconfig

Which suggests my grid is maybe different to hers.

I used your grid specifications, i.e. both nx_global and ny_global are set to 10 in datm_in and drof_in. (They remain equal to 360 in ice_in).

I’ve tried

queue: normal
ncpus: 100
jobfs: 10GB
mem: 192GB

and

queue: normal
ncpus: 100
jobfs: 10GB
mem: 500GB

With the same MPI errors.

My bathymetry file has dimensions (ny, nx) of 249 x 140.

Maybe I’ve made mistake somewhere while editing the configuration file, or in the notebook for generating the regional domain.

Or I have a configuration / environment issue with loading my MPI modules? Is openmpi/4.1.7(default) the desired MPI environment?

Hi Paul. Yes sorry, I’m set up for nx_global=4 and ny_global=4.

When I got the sigterm error my pbs log file thing (run_dir/archive/pbs_logs/1deg_jra55do_ia.o134401685) looked like this, with memory requested and memory used equal:

  
======================================================================================
                  Resource Usage on 2025-02-07 15:53:27:
   Job Id:             134401685.gadi-pbs
   Project:            jk72
   Exit Status:        1
   Service Units:      0.18
   NCPUs Requested:    16                     NCPUs Used: 16
                                           CPU Time Used: 00:03:07
   Memory Requested:   64.0GB                Memory Used: 64.0GB
   Walltime requested: 02:00:00            Walltime Used: 00:00:20
   JobFS requested:    10.0GB                 JobFS used: 0B
======================================================================================

so I just doubled my request. My successful run looks like this:

   Memory Requested:   128.0GB               Memory Used: 76.16GB

Still with 16 CPUS. Note - I’m not suggesting you change your number of CPUS since you will need to make other changes! Just providing context - I presume you will need significantly more memory than I am using

Hi Paul. The number of cpu’s needed is set in nuopc.runconfig – I have:

ocn_ntasks = 100

ocn_nthreads = 1

ocn_pestride = 1

ocn_rootpe = 0

which means start at 0 (rootpe) and go by 1 (pestride) to 100 (ntasks)

and in config.yaml I have an associated:

ncpus: 100

I think you want the two numbers to match. But if you want to do a quick test then increase ncpu’s in config.yaml

(note the discussion earlier by @ashjbarnes and @minghangli, I may need to revisit this 10x10 division).

I have just seen @mmr0’s reply - maybe changing the cpu’s in config.yaml to 16 might help?

I don’t want to be that guy (but I do really): a lot of confusion could be avoided if your configs were pushed to GitHub repos, then others could just inspect them.

Is that covered in the instructions? If not maybe it should be? Or do we need a separate knowledge-base topic for that?

1 Like

Ok I tried @mmr0’s config. I set nx_global and ny_global to 4.
I had to set ocn_ntasks = 16 in nuopc.runconfig' otherwise payu setup` threw the error.

Here is my stdout

======================================================================
                  Resource Usage on 2025-02-24 16:32:31:
   Job Id:             135858666.gadi-pbs
   Project:            gb02
   Exit Status:        1
   Service Units:      0.37
   NCPUs Requested:    16                     NCPUs Used: 16              
                                           CPU Time Used: 00:02:52        
   Memory Requested:   128.0GB               Memory Used: 76.03GB         
   Walltime requested: 02:00:00            Walltime Used: 00:00:21        
   JobFS requested:    10.0GB                 JobFS used: 0B              
======================================================================

The file access-om3.err still shows an MPI error.

 (t_initf) Read in prof_inparm namelist from: drv_in
 (t_initf) Using profile_disable=          F
 (t_initf)       profile_timer=                      4
 (t_initf)       profile_depth_limit=                4
 (t_initf)       profile_detail_limit=               2
 (t_initf)       profile_barrier=          F
 (t_initf)       profile_outpe_num=                  1
 (t_initf)       profile_outpe_stride=               0
 (t_initf)       profile_single_file=      F
 (t_initf)       profile_global_stats=     T
 (t_initf)       profile_ovhd_measurement= F
 (t_initf)       profile_add_detail=       F
 (t_initf)       profile_papi_enable=      F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-2210.gadi.nci.org.au:2024069] 15 more processes have sent help message help-mpi-api.txt / mpi-abort
[gadi-cpu-clx-2210.gadi.nci.org.au:2024069] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

When I revert ocn_ntasks = 100 andnx_global and ny_global to 100 my stdout contains


======================================================================
                  Resource Usage on 2025-02-24 17:20:31:
   Job Id:             135865764.gadi-pbs
   Project:            gb02
   Exit Status:        1
   Service Units:      2.08
   NCPUs Requested:    144                    NCPUs Used: 144             
                                           CPU Time Used: 00:12:03        
   Memory Requested:   128.0GB               Memory Used: 53.62GB         
   Walltime requested: 02:00:00            Walltime Used: 00:00:26        
   JobFS requested:    10.0GB                 JobFS used: 0B              
======================================================================

So I’m not running out of memory. Note - I also don’t know how nCPUs = 144. I specify 100 CPUs in config.yaml, but something else must override it, e.g.

$ grep 144 *
env.yaml:PBS_NCPUS: '144'
job.yaml:Resource_List.mpiprocs: '144'
job.yaml:Resource_List.ncpus: '144'
job.yaml:resources_used.ncpus: '144'

I still throw the MPI error, which isn’t caused by running out of memory.

Thanks Aidan you raise a good point, will do this

1 Like

Ah sorry for sending us all down that rabbit hole :sweat_smile:

I think the NCPUs used is the number actually used as we needed to request a whole node – something I need to improve on in the instructions, but I don’t think this is the current issue.

Can you please check for other error messages?
Do you have a rundir or an archive directory which contain other log files? Any error messages in these?

One thought I did have, if you needed to change the location of the forcing files in config.yaml did you check that you also changed the location in MOM_input and MOM_override? The model should not be looking for these in a forcing folder

i.e.

SURFACE_HEIGHT_IC_FILE = "forcing/init_eta.nc" !

Should still be

SURFACE_HEIGHT_IC_FILE = "init_eta.nc" !

All suggestions are welcome - we don’t know until we try! I appreciate that you and @Paul.Gregory are stress-testing these instructions as it is better to iron out the issues now!

Yep, I’ve checked all the paths in MOM_input and all references to 'forcing/ has been removed from pathnames.

I think I’ll wait for @mmr0 to commit her configuration so I can try something which we know works. If I still generate an MPI-related error, it can be isolated to my own gadi environment and not anything related to the ROM3 repo.

1 Like

I think there is an error in the instructions,

nx_global, ny_global in drof_in, datm_in etc should be the x and y size of the grid being used (not the processor layout).

(for completeness, these are the size of the grid defined by the ESMF mesh file used in the mediator/coupler. At this point we are using the same grid in the mediator and and the ocean model so its the size of the ocean grid. In the future it might be worth using a different grid in the mediator, especially when coupling with the atmosphere).

Thanks @anton – that is rather a big error that didn’t report an error message for me!

@Paul.Gregory, @mmr0, @Lizzie @PSpence – see the above post by Anton: nx_global, ny_global in drof_in, datm_in are meant to be set to the grid size rather than the tiling.

For my example this should be

nx_global = 140
ny_global = 249

If you are running a different domain then you can find these numbers in MOM_input as

NIGLOBAL = 140
NJGLOBAL = 249

(i.e. nx_global = NIGLOBAL, ny_global = NJGLOBAL)

In similar vain, ocn_nx and ocn_ny will need altering in NUOPC.runconfig

You will need to fix this in your runs, and I will fix it in the above wiki.

Feel free to submit a pull request :rofl: - see Simulation does not fail or give warnings despite mismatch between ice_in and grid.nc file · Issue #246 · COSIMA/access-om3 · GitHub

1 Like