Compiler optimisation flags for MOM5

Hi,

I would like to follow up on an interesting presentation by @spencerwong and @manodeep where they mentioned that new compiler optimisation flags have helped to speed up ESM1.6 and ESM1.5 by a substantial amount.

Would it be possible to please point me to which optimisation flags were used to compile MOM5 in this faster configuration? The reason being that I’m interested to use these tricks for my GFDL model which is built off the same codebase. help

Regards,

David

Hi David,

The speedups that we have gotten have primarily come from switching over to the sapphirerapids queue, and improving the load-balancing between the ocean and atmosphere. While I have tested a few compiler flags, nothing has given a consistent speedup in MOM5. @spencerwong can chime in too - my memory is not exactly a reliable narrator :sweat_smile:

@dkhutch Are you specifically looking for MOM5 speedup, or are you interested in ESM1.5/ESM1.6 performance improvement (say for the pre-industrial config)?

Thanks!
Manodeep

1 Like

Thanks Manodeep. Ok good to know you haven’t necessarily got MOM5 to run faster. For the GFDL coupled model, yes it’s just about getting the MOM5 compiler optimisation to the best settings. It might still be worth my checking what is the latest settings, because I’ve tended to lag behind by several years (e.g. I don’t know if I should be using a OneAPI compiler or even how I should set that up).

In regards to ESM1.5, I would certainly be interested to know if there are ways of updating existing runs (which are using the ACCESS-NRI supported releases) to go a bit faster. It was mentioned that in the notes that ESM1.5 could speed up from ~65 min to ~58 min, and SUs reduce from 950 to 810 / year. I would love to take advantage of that, even for runs that are already in progress because my runs take a really long time and as long as I document the changes, I see no issues with switching over mid-way through them.

Hi David,

We are currently working on detailed optimisation for ESM1.6, and planning to backport the improvements to a new 1.5 release when the ESM1.6 work is done. That work does include switching over to the latest oneAPI compiler, and more recent versions of software dependencies (including OpenMPI), swapping over to the sapphirerapids queue and changing some of the parameters.

For example, one of the best-case throughput I have seen for ESM1.5 is ~26 years/wall-day on the sapphirerapids queue (4 nodes, 55 mins wall-time & 775 SU cost for a 1-year run). Would it be useful to you for me to list the specific config changes that boosts performance (with the caveat that what holds for the oneAPI compiled binary may not hold for the released exe compiled with classic Intel)?

Thanks!
Manodeep (& Spencer)

2 Likes

Hi Manodeep and Spencer,
Yes that would be awesome to see the config changes that would give this kind of performance, thanks!
Regards David

Hi David,

There are four sets of changes for ESM1.5 PI-config - i) config.yaml to change the queue, the cpu partitioning, and an MPI parameter that seems to boost performance, ii) atmosphere/um_env.yaml for the UM cores and layout, iii) to atmosphere/namelists to change segment sizes, and iv) ocean/input.nml for the ocean layout

config.yaml changes

  1. Add the following to swap over to sapphirerapids, and specify what each node on the sapphirerapids queue contains
queue: normalsr
platform:
  nodesize: 104
  nodemem: 512
  1. Update the atmosphere ncpus to 240, and ocean ncpus to 156
  2. Add the following
mpi:
  flags:
    - --mca mpi_yield_when_idle true

atmosphere/um_env.yaml

  • Change UM_ATM_NPROCX to ‘16’, UM_ATM_NPROCY to ‘15’, and UM_NPES to ‘240’ (the quotes around the layout core numbers are necessary)

atmosphere/namelists

  • Set all three parameters A_CONVECT_SEG_SIZE, A_SW_SEG_SIZE and A_LW_SEG_SIZE to 24

ocean/input.nml

  • change layout to 13,12 under &ocean_model_nml

Hopefully making these changes should be straightforward at your end; otherwise we can hop on a zoom call to sort that out.

Note the results will not be bitwise-identical with a previous config since the ocean layout is being changed.

Tagging @georgyfalster - you might be interested in this as well.

Thanks!

1 Like

Thanks Manodeep. I’m trying a run now with these changes. I’ll let you know what performance I get.

Hi @manodeep @spencerwong ,
I can confirm that with these changes, my ACCESS-ESM1.5 run can now complete 1 year in ~57 min, costing 789 SUs. This is a really nice improvement as the run would previously take more like 64-65 min and cost 930-960 SUs per year.
Really great to be able to take advantage of these changes!!

I’m tagging a few paleo people who might also want to know about this.
@HIMADRI_SAINI @LaurieM @BenjaminAnthonisz @gpontes @YanxuanD @jbrown @georgyfalster .

7 Likes

Fantastic! Thanks for reporting back the performance improvement :smiley:

Hmmm… I suspect this is a system problem on Gadi but in the last 24 hours I’ve had all my jobs time out unexpectedly multiple times. Even with time limits of 2:00, jobs are failing to complete (when they should take more like 1 hour!). This is affecting both normal and normalsr jobs. I can’t make sense of any of it. Wondering if Gadi is having a bad couple of days. :man_shrugging:

Thanks David - yes, something happened on Gadi at the end of the week and performance was down nearly 2x. Thankfully, performance seems to be back to normal now.

amazing, thanks!! I’ll hopefully implement this in my runs next year (I am, as always, out of compute)