Regarding a name - @mlipson mentioned there’s also a similar discussion happening for the regional land atmosphere model. Should we adopt the same naming convention for both?
We currently use AUTO_MASKTABLE
for access-om3 0.25deg configurations, and it has performed well in my tests. You can find more details here Automated Runtime Land Block Elimination by alperaltuntas · Pull Request #263 · NCAR/MOM6 · GitHub. One thing to note is if you run the model with a small number of CPUs, the computational grid is divided into fewer domains or partitions. Since AUTO_MASKTABLE relies on having enough domains to apply land-sea masking correctly, this can cause issues with the masking process.
Hi @Helen
I’m back from holidays and I’ve worked through your extensive instructions. Thanks for making the effort to document them. Some changes I had to make.
The om3 config branch
dev-regional_jra55do_ryf
doesn’t exists, so I changed it to
dev-regional_jra55do_iaf
I didn’t compile my own MOM6 executable.
After running the COSIMA regional-mom6 notebook, I had to manually change the paths in my config.yaml
file from:
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/init_eta.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/init_vel.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_001.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_002.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_003.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing_obc_segment_004.nc
to
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_tracers.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_eta.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/init_vel.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_001.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_002.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_003.nc
- /scratch/gb02/pag548/regional_mom6_configs/tassie-access-om2-forced/forcing/forcing_obc_segment_004.nc
(i.e. adding the additional forcing
subdirectory)
However, the payu run
job submission fails. From the master stderr
file.
$ more 1deg_jra55do_ia.e135836516
Loading access-om3/pr30-5
Loading requirement: access-om3-nuopc/0.3.1-fvr7qaw
Currently Loaded Modulefiles:
1) access-om3-nuopc/0.3.1-fvr7qaw 3) nco/5.0.5 5) openmpi/4.1.7(default)
2) access-om3/pr30-5 4) pbs
payu: Model exited with error code 1; aborting.
The archived stderr
files suggests I have some MPI issues.
$ more archive/error_logs/access-om3.135836516.gadi-pbs.err
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 22 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-2325.gadi.nci.org.au:519572] PMIX ERROR: UNREACHABLE in file /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/opal/mca/p
mix/pmix3x/pmix/src/server/pmix_server.c at line 2198
...
etc.
Followed by
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread-2.28.s 000014AF3CCA0D10 Unknown Unknown Unknown
libpthread-2.28.s 000014AF3CC9A2E5 Unknown Unknown Unknown
libmpi.so.40.30.5 000014AF3D757CF8 Unknown Unknown Unknown
libopen-pal.so.40 000014AF37E8AD33 opal_progress Unknown Unknown
libopen-pal.so.40 000014AF37E8AEE5 ompi_sync_wait_mt Unknown Unknown
libmpi.so.40.30.5 000014AF3D75B4F8 ompi_comm_nextcid Unknown Unknown
libmpi.so.40.30.5 000014AF3D767B66 ompi_comm_create_ Unknown Unknown
libmpi.so.40.30.5 000014AF3D738FD0 PMPI_Comm_create_ Unknown Unknown
libmpi_mpifh.so 000014AF3DA8780E Unknown Unknown Unknown
access-om3-MOM6 0000000002D67DC2 mpp_mod_mp_get_pe 134 mpp_util_mpi.inc
access-om3-MOM6 0000000002E316EF mpp_mod_mp_mpp_in 80 mpp_comm_mpi.inc
access-om3-MOM6 0000000002C507B6 fms_mod_mp_fms_in 367 fms.F90
access-om3-MOM6 0000000001B8D94B mom_cap_mod_mp_in 537 mom_cap.F90
access-om3-MOM6 00000000009F07F2 Unknown Unknown Unknown
Any ideas where I’ve tripped up the Open MPI setup?
Hi Paul, I also had this issue (sorry I didn’t document). I found I needed to increase the memory in config.yaml for regional OM3 versus the old regional model. That is my usual solution to this particular error
My domain is small & coarse for testing, and my PBS config looks like this
queue: normal
ncpus: 16
jobfs: 10GB
mem: 128GB
Thanks @Paul.Gregory and @mmr0
Paul, when you get a chance can you please report back on if Madi’s suggestion worked? If it doesn’t then I will investigate further but if it does then we should edit the wiki to suggest a larger memory!
Good catch! - thanks I edited the name the other day.
This is going to be an ongoing issue until a large pull request goes through. The pull request places the files in a different folder than the original which means that at the moment half the people using this get the outputs in the forcing files and half do not. Once the pull request goes through we can standardise everything a bit better.
I still can’t run my configuration. When I try @mmr0’s config payu setup
generates the error
ValueError: Insufficient cpus for the ocn pelayout in nuopc.runconfig
Which suggests my grid is maybe different to hers.
I used your grid specifications, i.e. both nx_global
and ny_global
are set to 10 in datm_in
and drof_in
. (They remain equal to 360 in ice_in
).
I’ve tried
queue: normal
ncpus: 100
jobfs: 10GB
mem: 192GB
and
queue: normal
ncpus: 100
jobfs: 10GB
mem: 500GB
With the same MPI errors.
My bathymetry file has dimensions (ny
, nx
) of 249 x 140.
Maybe I’ve made mistake somewhere while editing the configuration file, or in the notebook for generating the regional domain.
Or I have a configuration / environment issue with loading my MPI modules? Is openmpi/4.1.7(default)
the desired MPI environment?
Hi Paul. Yes sorry, I’m set up for nx_global=4
and ny_global=4
.
When I got the sigterm error my pbs log file thing (run_dir/archive/pbs_logs/1deg_jra55do_ia.o134401685
) looked like this, with memory requested and memory used equal:
======================================================================================
Resource Usage on 2025-02-07 15:53:27:
Job Id: 134401685.gadi-pbs
Project: jk72
Exit Status: 1
Service Units: 0.18
NCPUs Requested: 16 NCPUs Used: 16
CPU Time Used: 00:03:07
Memory Requested: 64.0GB Memory Used: 64.0GB
Walltime requested: 02:00:00 Walltime Used: 00:00:20
JobFS requested: 10.0GB JobFS used: 0B
======================================================================================
so I just doubled my request. My successful run looks like this:
Memory Requested: 128.0GB Memory Used: 76.16GB
Still with 16 CPUS. Note - I’m not suggesting you change your number of CPUS since you will need to make other changes! Just providing context - I presume you will need significantly more memory than I am using
Hi Paul. The number of cpu’s needed is set in nuopc.runconfig – I have:
ocn_ntasks = 100
ocn_nthreads = 1
ocn_pestride = 1
ocn_rootpe = 0
which means start at 0 (rootpe) and go by 1 (pestride) to 100 (ntasks)
and in config.yaml I have an associated:
ncpus: 100
I think you want the two numbers to match. But if you want to do a quick test then increase ncpu’s in config.yaml
(note the discussion earlier by @ashjbarnes and @minghangli, I may need to revisit this 10x10 division).
I have just seen @mmr0’s reply - maybe changing the cpu’s in config.yaml to 16 might help?
I don’t want to be that guy (but I do really): a lot of confusion could be avoided if your configs were pushed to GitHub repos, then others could just inspect them.
Is that covered in the instructions? If not maybe it should be? Or do we need a separate knowledge-base topic for that?
Ok I tried @mmr0’s config. I set nx_global
and ny_global
to 4.
I had to set ocn_ntasks = 16
in nuopc.runconfig' otherwise
payu setup` threw the error.
Here is my stdout
======================================================================
Resource Usage on 2025-02-24 16:32:31:
Job Id: 135858666.gadi-pbs
Project: gb02
Exit Status: 1
Service Units: 0.37
NCPUs Requested: 16 NCPUs Used: 16
CPU Time Used: 00:02:52
Memory Requested: 128.0GB Memory Used: 76.03GB
Walltime requested: 02:00:00 Walltime Used: 00:00:21
JobFS requested: 10.0GB JobFS used: 0B
======================================================================
The file access-om3.err
still shows an MPI error.
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gadi-cpu-clx-2210.gadi.nci.org.au:2024069] 15 more processes have sent help message help-mpi-api.txt / mpi-abort
[gadi-cpu-clx-2210.gadi.nci.org.au:2024069] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
When I revert ocn_ntasks = 100
andnx_global
and ny_global
to 100 my stdout contains
======================================================================
Resource Usage on 2025-02-24 17:20:31:
Job Id: 135865764.gadi-pbs
Project: gb02
Exit Status: 1
Service Units: 2.08
NCPUs Requested: 144 NCPUs Used: 144
CPU Time Used: 00:12:03
Memory Requested: 128.0GB Memory Used: 53.62GB
Walltime requested: 02:00:00 Walltime Used: 00:00:26
JobFS requested: 10.0GB JobFS used: 0B
======================================================================
So I’m not running out of memory. Note - I also don’t know how nCPUs = 144. I specify 100 CPUs in config.yaml, but something else must override it, e.g.
$ grep 144 *
env.yaml:PBS_NCPUS: '144'
job.yaml:Resource_List.mpiprocs: '144'
job.yaml:Resource_List.ncpus: '144'
job.yaml:resources_used.ncpus: '144'
I still throw the MPI error, which isn’t caused by running out of memory.
Thanks Aidan you raise a good point, will do this
Ah sorry for sending us all down that rabbit hole
I think the NCPUs used is the number actually used as we needed to request a whole node – something I need to improve on in the instructions, but I don’t think this is the current issue.
Can you please check for other error messages?
Do you have a rundir or an archive directory which contain other log files? Any error messages in these?
One thought I did have, if you needed to change the location of the forcing files in config.yaml did you check that you also changed the location in MOM_input and MOM_override? The model should not be looking for these in a forcing folder
i.e.
SURFACE_HEIGHT_IC_FILE = "forcing/init_eta.nc" !
Should still be
SURFACE_HEIGHT_IC_FILE = "init_eta.nc" !
All suggestions are welcome - we don’t know until we try! I appreciate that you and @Paul.Gregory are stress-testing these instructions as it is better to iron out the issues now!
Yep, I’ve checked all the paths in MOM_input
and all references to 'forcing/
has been removed from pathnames.
I think I’ll wait for @mmr0 to commit her configuration so I can try something which we know works. If I still generate an MPI-related error, it can be isolated to my own gadi
environment and not anything related to the ROM3 repo.
I think there is an error in the instructions,
nx_global
, ny_global
in drof_in
, datm_in
etc should be the x and y size of the grid being used (not the processor layout).
(for completeness, these are the size of the grid defined by the ESMF mesh file used in the mediator/coupler. At this point we are using the same grid in the mediator and and the ocean model so its the size of the ocean grid. In the future it might be worth using a different grid in the mediator, especially when coupling with the atmosphere).
Thanks @anton – that is rather a big error that didn’t report an error message for me!
@Paul.Gregory, @mmr0, @Lizzie @PSpence – see the above post by Anton: nx_global, ny_global in drof_in, datm_in are meant to be set to the grid size rather than the tiling.
For my example this should be
nx_global = 140
ny_global = 249
If you are running a different domain then you can find these numbers in MOM_input as
NIGLOBAL = 140
NJGLOBAL = 249
(i.e. nx_global = NIGLOBAL, ny_global = NJGLOBAL)
In similar vain, ocn_nx and ocn_ny will need altering in NUOPC.runconfig
You will need to fix this in your runs, and I will fix it in the above wiki.
Feel free to submit a pull request - see Simulation does not fail or give warnings despite mismatch between ice_in and grid.nc file · Issue #246 · COSIMA/access-om3 · GitHub