ACCESS-rAM3 'Flagship' Experiments

A thread to document large rAM3 experiments used as ‘flagships’ for the 21stCenturyWeather Centre-of-Excellence.

Here is the domain for the first experiment. We are using BARRA-R2 to initialise the land-surface, which constrains the outer nest extents to lie within the BARRA-R2 domain.

I had to make the following changes to the rAM3 defaults to cycle for two days.

  1. Change the rose-suite metadata to allow rose edit to view four resolutions: python2 setup_metadata 4 4 (note: this will erase your existing rose-suit.conf file)
  2. Doubled NPROC to 36,32 for the 5 km resolution
  3. Increased NPROC to 54,48 for 1 km resolution
  4. Double WALL_CLOCK_LIMIT from 1800 (30 mins) to 3600 (60 mins) to run the 6 hour forecast tasks for 1 km resolution (they take about 40 minutes using 58x48 processors).

Current suite takes about 5.5 hours to run 24 hours. The forecast tasks themselves take just over 4 hours.

It takes about 15 kSU to run 24 hours of forecasts, which consumes about 750 Gb of disk space on /scratch.

Here are the details of this configuration (open in new tab to enlarge):

I repeated this configuration with rg01_rs0[1-3]_ioproc=48 which showed no difference.

Further work:

  1. Try OpenMP threads. What is the UM namelist control to activate this? And will this require recompilation of the UM?
  2. Incorporate @m.lipson 's world-cover surface data for the 1km nest.
  3. Run some scaling tests for the 5 km and 1 km domains.

I’ve encountered some NPROC restrictions with the outer resolution. I generated this message: ? Error message: Too many processors in the North-South direction.The maximum permitted is 32 ,when NPROC 36,32 was applied to the 12km nest. What is causing this error? Is this a restriction with the GAL9 model? (Because I can run the RAL3P2 model with more CPUs).

1 Like

OpenMP doesn’t require a recompile, just set environment variable $OMP_NUM_THREADS (normally done in the Cylc task configuration) and make sure the PBS number of CPUs is (MPI ranks) * (OMP threads)

NPROC restriction is because there’s a minimum grid size for each process, e.g. halo of cells going to the north MPI rank can’t overlap with the halo of cells going to the south MPI rank.

1 Like

I think the number of threads are required to be > 1 if you are using io servers, you may already be using them

1 Like

Thanks for sharing what you’ve been doing @Paul.Gregory.

Thanks Scott.

I found in this suite, the variable is set in site/nci-gadi/suite-adds.rc. You change
{% set UM_ATM_OMP=1 %}
to
{% set UM_ATM_OMP=2 %}

The Jinja macro atmos_resources(nx, ny, omp, mem) then computes the overall PBS CPUs using
-l ncpus = {{(nx * ny + ios) * omp}}

There is appreciable speed-up for only a small increase in SUs, especially for the nests with lower CPU allocations.

1 Like

Hi Paul

I just went through the introduction to run ACCESS-rAM3 here, and would like to run a simulation like this flagship experiment. I would like to have different nest regions. Would you suggest starting from your simulation u-dq126 or u-dg767/u-dg768?

Regards, Qinggang

Hi @qinggangg

It depends on how you want to configure your nests and what land surface model you are going to use.

How many different nests are you planning on using with how many points in each?

I’m about to swap u-dq126 over to a difference ancillary suite which uses the World Cover data for better representation of urban land surface regions. That may not be important for your experiments.

I’ll be in the office tomorrow if you want to catch up in person. It might be easier than discussing the pros/cons here.

Hi Paul

I am happy to catch up in person tomorrow. Do you have a preferred time?

I will plot the nest in my mind for discussions tomorrow. Outer domains are the same as BARRA-C2 and R2, and two inner domains over the Great Barrier Reef region. I would like the same land surface model as C2 and R2.

The two inner domains will have grid spacings of 4.4 km and 1.1 km, both driven by the larger R2 nest.

I’m free all day.

BTW @Scott created some tools and notebooks to help configure your UM nests.

I’ve put them in a repo here:

Have a look at UM_configuration_tools/notebooks/Flagship_domain.ipynb at main · 21centuryweather/UM_configuration_tools · GitHub

1 Like

Great. I booked the board room 406 tomorrow at 11 am, hope that works for you.

I will have a look at the repository before then.

An update.

With @mlipson’s help we have incorporated the World Cover ancillaries into the flagship experiment. The differences the urban vegetation fields around Sydney and in the full 1-km domain can be seen at UM_configuration_tools/notebooks/Check_Flagship_worldcover.ipynb at main · 21centuryweather/UM_configuration_tools · GitHub)

There are issues with the suite when increasing the number of processors beyond 52x48 for the 1km domain. Any attempted increase (80x72, 72x70, 56x54) fails. The UM forecast tasks just consume increasing amounts of walltime and intermittently fail.

I checked a 56x54 task which initially failed, but then succeeded after resubmission. The stdout file (containing the solver residuals for every timestep) and stderr are exact, until the point of failure. When using more processors, the UM forecast completes, but then the job hangs. There is no error message which explains why it fails at the UM write dump and STASH field gathering functions.

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   000014D48FFF2555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  000014D48D45F990  Unknown               Unknown  Unknown
libucp.so.0.0.0    000014D488BDB7D0  ucp_worker_progre     Unknown  Unknown
libmpi.so.40.30.5  000014D490608BFF  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.30.5  000014D4906E00BB  mca_coll_basic_ga     Unknown  Unknown
libmpi.so.40.30.5  000014D490771D30  PMPI_Gatherv          Unknown  Unknown
libmpi_mpifh.so    000014D490ADF5FC  Unknown               Unknown  Unknown
um-atmos.exe       0000000003389BFE  mpl_gatherv_               98  mpl_gatherv.F90
um-atmos.exe       0000000000663EAB  gather_field_mpl_         222  gather_field_mpl.F90
um-atmos.exe       0000000000663606  gather_field_mod_         127  gather_field.F90
um-atmos.exe       0000000000658848  stash_gather_fiel         399  stash_gather_field.F90
um-atmos.exe       000000000076F73F  general_gather_fi         413  general_gather_field.F90
um-atmos.exe       0000000000BDF0BB  um_writdump_mod_m         522  um_writdump.F90
um-atmos.exe       0000000000BDD65C  dumpctl_mod_mp_du         207  dumpctl.F90
um-atmos.exe       00000000004F065D  u_model_4a_mod_mp         452  u_model_4A.F90
um-atmos.exe       000000000040CA38  um_shell_mod_mp_u         748  um_shell.F90
um-atmos.exe       00000000004093F8  MAIN__                     60  um_main.F90
um-atmos.exe       00000000004093A2  Unknown               Unknown  Unknown
libc-2.28.so       000014D48D0B17E5  __libc_start_main     Unknown  Unknown
um-atmos.exe       00000000004092AE  Unknown               Unknown  Unknown

I assume the job fails because I have not correctly configured an I/O server. @srennie from BoM kindly pointed me to this link (at the UKMO Trac website) containing UM hints for running on gadi: (link requires a MOSRS account) https://code.metoffice.gov.uk/trac/nwpscience/wiki/bomnwpscience/AccessNWP_SuiteOptimisations

It also contains links to configuring the I/O server which I’ll implement in the coming weeks.

In addition to improving IO, another option is to replicate the AUS2200 suite. Dale Roberts wrote a lot of documentation about the optimisations he achieved with AUS2200 when running on the gadi sapphire nodes, see here:

These optimisations are visible in the AUS2200 suite at :

https://code.metoffice.gov.uk/trac/roses-u/browser/c/s/1/4/2/trunk/site/nci-gadi/suite-adds.rc

210	{% macro atmos_resources(nx, ny, omp, mem) %}
211	    {% set ios = 48 %}
212	    {% if ( 'normal' == UM_ATM_NCI_QUEUE ) or ( 'express' == UM_ATM_NCI_QUEUE ) %}
213	        {% set cores_per_node = 48 %}
214	    {% elif ( 'normalbw' == UM_ATM_NCI_QUEUE ) or ( 'expressbw' == UM_ATM_NCI_QUEUE ) %}
215	        {% set cores_per_node = 28 %}
216	    {% elif ( 'normalsl' == UM_ATM_NCI_QUEUE ) %}
217	        {% set cores_per_node = 32 %}
218	    {% elif ( 'normalsr' == UM_ATM_NCI_QUEUE ) or ( 'expresssr' == UM_ATM_NCI_QUEUE ) %}
219	        {% set cores_per_node = 104 %}
220	    {% endif %}
221	    {% set threads_per_node = 96 %}
222	    {% set nnodes = (( nx * ny + ios ) * omp / threads_per_node)|round(0,'ceil')|int %}
223	        [[[ directives ]]]
224	            -q          = {{UM_ATM_NCI_QUEUE}}
225	            -l ncpus    = {{ nnodes * cores_per_node }}
226	            -l mem      = {{mem * (nx * ny + ios) * omp}}mb
227	            -l jobfs    = {{mem * (nx * ny + ios) * omp}}mb
228	        [[[ environment ]]]
229	            UM_ATM_NPROCY   = {{ny}}
230	            UM_ATM_NPROCX   = {{nx}}
231	            OMP_NUM_THREADS = {{omp}}
232	            FLUME_IOS_NPROC = {{ios}}
233	            #ATMOS_LAUNCHER = mpirun -n {{nx * ny + ios}} --map-by node:PE={{omp}} --rank-by core
234	            {# spr needs special binding options thanks to many NUMA nodes #}
235	            {% if ( ( 'normalsr' == UM_ATM_NCI_QUEUE ) or ( 'expresssr' == UM_ATM_NCI_QUEUE ) ) and ( threads_per_node < cores_per_node ) %}
236	            ATMOS_LAUNCHER  = mpirun -n {{nx * ny + ios}} --map-by ppr:$(( {{threads_per_node}} / $PBS_NCI_NUMA_PER_NODE / {{omp}} )):numa:PE={{omp}} --rank-by core
237	            {% else %}
238	            ATMOS_LAUNCHER  = mpirun -n {{nx * ny + ios}} --map-by ppr:{{(threads_per_node/omp/2)|round|int}}:socket:PE={{omp}} --rank-by core
239	            {% endif %}
240	            ROMIO_HINTS = /home/563/dr4292/hints.txt
241	            OMPI_MCA_io = romio321
242	        [[[job]]]
243	            execution time limit =  PT{{WALL_CLOCK_LIMIT}}S
244	{% endmacro %}

The parameters in that jinja macro that I’m unfamiliar with are

            ROMIO_HINTS = /home/563/dr4292/hints.txt
            OMPI_MCA_io = romio321

which are MPI-related environment variables. See https://wordpress.cels.anl.gov/romio/2008/09/26/system-hints-hints-via-config-file/
and

I might ask Dale for the copy of his ROMIO_HINTS file, as that current location doesn’t have read access.

Is anyone else familiar with ROMIO_HINTS and OMPI_MCA_io ?

BTW does anyone have @dale.roberts 's contact details?

I’d like know what the contents of
/home/563/dr4292/hints.txt
are.

Thanks.

@Paul.Gregory here it is.

### Striping to match input file striping for next run
striping_factor 8
striping_unit 5242880
cb_nodes 8
cb_buffer_size 8388608

Note that this is tuned for AUS2200, you’ll need to adapt these to this model. The striping_factor and cb_nodes settings were chosen such that no aggregation occurred between the IO server tasks before writing. The 8MB cb_buffer_size was picked because it is larger than 1/8th the size of a 64-bit STASH field. According to my notes the 5MB lustre stripe aligns with the field sizes in the model dump, which I guess also improves read speed for the restarts? Again, you’ll need to adapt these values for your configuration.

1 Like

Thanks a lot Dale. Appreciated.