ACCESS-rAM3 'Flagship' Experiments

Paul.Gregory · 5 June 2025 01:40

A thread to document large rAM3 experiments used as ‘flagships’ for the 21stCenturyWeather Centre-of-Excellence.

Here is the domain for the first experiment. We are using BARRA-R2 to initialise the land-surface, which constrains the outer nest extents to lie within the BARRA-R2 domain.

I had to make the following changes to the rAM3 defaults to cycle for two days.

Change the rose-suite metadata to allow rose edit to view four resolutions: python2 setup_metadata 4 4 (note: this will erase your existing rose-suit.conf file)
Doubled NPROC to 36,32 for the 5 km resolution
Increased NPROC to 54,48 for 1 km resolution
Double WALL_CLOCK_LIMIT from 1800 (30 mins) to 3600 (60 mins) to run the 6 hour forecast tasks for 1 km resolution (they take about 40 minutes using 58x48 processors).

Current suite takes about 5.5 hours to run 24 hours. The forecast tasks themselves take just over 4 hours.

It takes about 15 kSU to run 24 hours of forecasts, which consumes about 750 Gb of disk space on /scratch.

Here are the details of this configuration (open in new tab to enlarge):

I repeated this configuration with rg01_rs0[1-3]_ioproc=48 which showed no difference.

Further work:

Try OpenMP threads. What is the UM namelist control to activate this? And will this require recompilation of the UM?
Incorporate @m.lipson 's world-cover surface data for the 1km nest.
Run some scaling tests for the 5 km and 1 km domains.

I’ve encountered some NPROC restrictions with the outer resolution. I generated this message: ? Error message: Too many processors in the North-South direction.The maximum permitted is 32 ,when NPROC 36,32 was applied to the 12km nest. What is causing this error? Is this a restriction with the GAL9 model? (Because I can run the RAL3P2 model with more CPUs).

Scott · 5 June 2025 03:50

OpenMP doesn’t require a recompile, just set environment variable $OMP_NUM_THREADS (normally done in the Cylc task configuration) and make sure the PBS number of CPUs is (MPI ranks) * (OMP threads)

NPROC restriction is because there’s a minimum grid size for each process, e.g. halo of cells going to the north MPI rank can’t overlap with the halo of cells going to the south MPI rank.

Scott · 5 June 2025 03:53

I think the number of threads are required to be > 1 if you are using io servers, you may already be using them

Aidan · 5 June 2025 07:07

Thanks for sharing what you’ve been doing @Paul.Gregory.

Paul.Gregory · 10 June 2025 01:09

Thanks Scott.

I found in this suite, the variable is set in site/nci-gadi/suite-adds.rc. You change
{% set UM_ATM_OMP=1 %}
to
{% set UM_ATM_OMP=2 %}

The Jinja macro atmos_resources(nx, ny, omp, mem) then computes the overall PBS CPUs using
-l ncpus = {{(nx * ny + ios) * omp}}

There is appreciable speed-up for only a small increase in SUs, especially for the nests with lower CPU allocations.

qinggangg · 12 June 2025 03:30

Hi Paul

I just went through the introduction to run ACCESS-rAM3 here, and would like to run a simulation like this flagship experiment. I would like to have different nest regions. Would you suggest starting from your simulation u-dq126 or u-dg767/u-dg768?

Regards, Qinggang

Paul.Gregory · 12 June 2025 04:27

Hi @qinggangg

It depends on how you want to configure your nests and what land surface model you are going to use.

How many different nests are you planning on using with how many points in each?

I’m about to swap u-dq126 over to a difference ancillary suite which uses the World Cover data for better representation of urban land surface regions. That may not be important for your experiments.

I’ll be in the office tomorrow if you want to catch up in person. It might be easier than discussing the pros/cons here.

qinggangg · 12 June 2025 04:36

Hi Paul

I am happy to catch up in person tomorrow. Do you have a preferred time?

I will plot the nest in my mind for discussions tomorrow. Outer domains are the same as BARRA-C2 and R2, and two inner domains over the Great Barrier Reef region. I would like the same land surface model as C2 and R2.

The two inner domains will have grid spacings of 4.4 km and 1.1 km, both driven by the larger R2 nest.

Paul.Gregory · 12 June 2025 06:25

I’m free all day.

BTW @Scott created some tools and notebooks to help configure your UM nests.

I’ve put them in a repo here:

Have a look at UM_configuration_tools/notebooks/Flagship_domain.ipynb at main · 21centuryweather/UM_configuration_tools · GitHub

qinggangg · 12 June 2025 06:38

Great. I booked the board room 406 tomorrow at 11 am, hope that works for you.

I will have a look at the repository before then.

Paul.Gregory · 16 June 2025 06:25

An update.

With @mlipson’s help we have incorporated the World Cover ancillaries into the flagship experiment. The differences the urban vegetation fields around Sydney and in the full 1-km domain can be seen at UM_configuration_tools/notebooks/Check_Flagship_worldcover.ipynb at main · 21centuryweather/UM_configuration_tools · GitHub)

There are issues with the suite when increasing the number of processors beyond 52x48 for the 1km domain. Any attempted increase (80x72, 72x70, 56x54) fails. The UM forecast tasks just consume increasing amounts of walltime and intermittently fail.

I checked a 56x54 task which initially failed, but then succeeded after resubmission. The stdout file (containing the solver residuals for every timestep) and stderr are exact, until the point of failure. When using more processors, the UM forecast completes, but then the job hangs. There is no error message which explains why it fails at the UM write dump and STASH field gathering functions.

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   000014D48FFF2555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  000014D48D45F990  Unknown               Unknown  Unknown
libucp.so.0.0.0    000014D488BDB7D0  ucp_worker_progre     Unknown  Unknown
libmpi.so.40.30.5  000014D490608BFF  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.30.5  000014D4906E00BB  mca_coll_basic_ga     Unknown  Unknown
libmpi.so.40.30.5  000014D490771D30  PMPI_Gatherv          Unknown  Unknown
libmpi_mpifh.so    000014D490ADF5FC  Unknown               Unknown  Unknown
um-atmos.exe       0000000003389BFE  mpl_gatherv_               98  mpl_gatherv.F90
um-atmos.exe       0000000000663EAB  gather_field_mpl_         222  gather_field_mpl.F90
um-atmos.exe       0000000000663606  gather_field_mod_         127  gather_field.F90
um-atmos.exe       0000000000658848  stash_gather_fiel         399  stash_gather_field.F90
um-atmos.exe       000000000076F73F  general_gather_fi         413  general_gather_field.F90
um-atmos.exe       0000000000BDF0BB  um_writdump_mod_m         522  um_writdump.F90
um-atmos.exe       0000000000BDD65C  dumpctl_mod_mp_du         207  dumpctl.F90
um-atmos.exe       00000000004F065D  u_model_4a_mod_mp         452  u_model_4A.F90
um-atmos.exe       000000000040CA38  um_shell_mod_mp_u         748  um_shell.F90
um-atmos.exe       00000000004093F8  MAIN__                     60  um_main.F90
um-atmos.exe       00000000004093A2  Unknown               Unknown  Unknown
libc-2.28.so       000014D48D0B17E5  __libc_start_main     Unknown  Unknown
um-atmos.exe       00000000004092AE  Unknown               Unknown  Unknown

I assume the job fails because I have not correctly configured an I/O server. @srennie from BoM kindly pointed me to this link (at the UKMO Trac website) containing UM hints for running on gadi: (link requires a MOSRS account) https://code.metoffice.gov.uk/trac/nwpscience/wiki/bomnwpscience/AccessNWP_SuiteOptimisations

It also contains links to configuring the I/O server which I’ll implement in the coming weeks.

In addition to improving IO, another option is to replicate the AUS2200 suite. Dale Roberts wrote a lot of documentation about the optimisations he achieved with AUS2200 when running on the gadi sapphire nodes, see here:

These optimisations are visible in the AUS2200 suite at :

https://code.metoffice.gov.uk/trac/roses-u/browser/c/s/1/4/2/trunk/site/nci-gadi/suite-adds.rc

210	{% macro atmos_resources(nx, ny, omp, mem) %}
211	    {% set ios = 48 %}
212	    {% if ( 'normal' == UM_ATM_NCI_QUEUE ) or ( 'express' == UM_ATM_NCI_QUEUE ) %}
213	        {% set cores_per_node = 48 %}
214	    {% elif ( 'normalbw' == UM_ATM_NCI_QUEUE ) or ( 'expressbw' == UM_ATM_NCI_QUEUE ) %}
215	        {% set cores_per_node = 28 %}
216	    {% elif ( 'normalsl' == UM_ATM_NCI_QUEUE ) %}
217	        {% set cores_per_node = 32 %}
218	    {% elif ( 'normalsr' == UM_ATM_NCI_QUEUE ) or ( 'expresssr' == UM_ATM_NCI_QUEUE ) %}
219	        {% set cores_per_node = 104 %}
220	    {% endif %}
221	    {% set threads_per_node = 96 %}
222	    {% set nnodes = (( nx * ny + ios ) * omp / threads_per_node)|round(0,'ceil')|int %}
223	        [[[ directives ]]]
224	            -q          = {{UM_ATM_NCI_QUEUE}}
225	            -l ncpus    = {{ nnodes * cores_per_node }}
226	            -l mem      = {{mem * (nx * ny + ios) * omp}}mb
227	            -l jobfs    = {{mem * (nx * ny + ios) * omp}}mb
228	        [[[ environment ]]]
229	            UM_ATM_NPROCY   = {{ny}}
230	            UM_ATM_NPROCX   = {{nx}}
231	            OMP_NUM_THREADS = {{omp}}
232	            FLUME_IOS_NPROC = {{ios}}
233	            #ATMOS_LAUNCHER = mpirun -n {{nx * ny + ios}} --map-by node:PE={{omp}} --rank-by core
234	            {# spr needs special binding options thanks to many NUMA nodes #}
235	            {% if ( ( 'normalsr' == UM_ATM_NCI_QUEUE ) or ( 'expresssr' == UM_ATM_NCI_QUEUE ) ) and ( threads_per_node < cores_per_node ) %}
236	            ATMOS_LAUNCHER  = mpirun -n {{nx * ny + ios}} --map-by ppr:$(( {{threads_per_node}} / $PBS_NCI_NUMA_PER_NODE / {{omp}} )):numa:PE={{omp}} --rank-by core
237	            {% else %}
238	            ATMOS_LAUNCHER  = mpirun -n {{nx * ny + ios}} --map-by ppr:{{(threads_per_node/omp/2)|round|int}}:socket:PE={{omp}} --rank-by core
239	            {% endif %}
240	            ROMIO_HINTS = /home/563/dr4292/hints.txt
241	            OMPI_MCA_io = romio321
242	        [[[job]]]
243	            execution time limit =  PT{{WALL_CLOCK_LIMIT}}S
244	{% endmacro %}

The parameters in that jinja macro that I’m unfamiliar with are

            ROMIO_HINTS = /home/563/dr4292/hints.txt
            OMPI_MCA_io = romio321

which are MPI-related environment variables. See https://wordpress.cels.anl.gov/romio/2008/09/26/system-hints-hints-via-config-file/
and

I might ask Dale for the copy of his ROMIO_HINTS file, as that current location doesn’t have read access.

Is anyone else familiar with ROMIO_HINTS and OMPI_MCA_io ?

Paul.Gregory · 19 June 2025 06:52

BTW does anyone have @dale.roberts 's contact details?

I’d like know what the contents of
/home/563/dr4292/hints.txt
are.

Thanks.

dale.roberts · 19 June 2025 07:16

@Paul.Gregory here it is.

### Striping to match input file striping for next run
striping_factor 8
striping_unit 5242880
cb_nodes 8
cb_buffer_size 8388608

Note that this is tuned for AUS2200, you’ll need to adapt these to this model. The striping_factor and cb_nodes settings were chosen such that no aggregation occurred between the IO server tasks before writing. The 8MB cb_buffer_size was picked because it is larger than 1/8th the size of a 64-bit STASH field. According to my notes the 5MB lustre stripe aligns with the field sizes in the model dump, which I guess also improves read speed for the restarts? Again, you’ll need to adapt these values for your configuration.

Paul.Gregory · 19 June 2025 08:52

Thanks a lot Dale. Appreciated.

Paul.Gregory · 22 July 2025 08:07

An update. I managed to get the I/O server running. It appears there is a conflict with Fortran unit numbers when running the I/O server with netCDF output enabled.

@cbengel provided a branch of rAM3 which used Fields Files outputs, which removed the conflict between Fortran unit numbers.

See here for further details: Issues with configuring the UM I/O server - #6 by Paul.Gregory

The default ROMIO_HINTS settings also provided a useful speedup.

I compiled a UM executable for the Sapphirerapids nodes on the normalsr queue, see : Building the UM to run on Gadi's Sapphire nodes - #4 by Paul.Gregory

That also provided a useful speedup, but I’m still a fair way behind AUS2200 in terms of kSU/day using the same domain decomposition. While the AUS2200 configuration is similar to rAM3, it’s not exact, so I want to run AUS2200 for a day or two and examine the I/O server logs to compare against the rAM3 setup.

Paul.Gregory · 29 July 2025 02:18

Another update. I found the commands in Dale’s AUS2200 suite for striping the directories where the large output files reside.

The 1km Flagship domain now runs with similar performance to the AUS2200 domain when using the same decomposition (68x68), and I/O settings, on the sapphire nodes.

Recent experiments I did using that decomposition.

Parameter	AUS2200	rAM3 (1 km)
Size	2120 x 2600	2112 x 2000
Levels	L70_40km	L90_40km
Timestep	75 seconds	60 seconds
Radiation timestep	Same	Same
LBC frequency	3600	600
SU/day	11182	11519
Wall clock	0:32:55	0:34:32

The STASH is slightly different for the two systems.

By default, the striping is applied in two directories:

The location of the UM initial condition file (i.e. <SUITE_ID>/share/cycle/<TIMESTAMP>/<REGION>/<NEST>/<MODEL>/ics/ umnsaa_da*
For both systems this is a large (~80 Gb) file
The converted ERA5 grib files <SUITE_ID>/share/cycle/<TIMESTAMP>/ec/um. For AUS2200 there are six 72 Gb files per cycle, whereas for rAM3 there are twenty four 6 Gb files per cycle.

So striping would need to be altered (or perhaps even removed?) for the rAM3 ec/um directory.

As mentioned by Dale above, the striping settings need to be tuned for each decomposition. The existing settings provide less performance increment for other decompositions (e.g. 54x58, 36x36), and give a small performance penalty for the 5 km domain.

In the 21stCenturyWeather Modelling Science meeting yesterday, Andy Hogg suggested there could be appetite in the community to run rAM3 across increasingly larger domains, and document the performance benefits with I/O improvements, running on SapphireRapids, partially committed nodes etc.

This would help give users advice for which settings to use and when, and help provide justification for NCI compute grants, the latter apparently requires evidence that your code has been optimised to some degree.

Topic		Replies	Views
ACCESS-rAM3: Release Information ACCESS-NRI Releases release , model , access-ram3	2	274	29 April 2025
ACCESS-rAM3 Release 1.0 Feedback Regional Nesting Suite regional , atmosphere , feedback , access-ram , access-ram3	20	171	11 August 2025
ACCESS-ROM3 setup instructions Regional MOM6 regional , tutorial , om3	80	678	12 June 2025
Issues with configuring the UM I/O server Unified Model io-server	6	67	1 July 2025
Meeting Minutes 2025: Atmosphere Working Group Working Group	5	179	23 July 2025

ACCESS-rAM3 'Flagship' Experiments

Related topics