COSIMA TWG Meeting Minutes 2023

Summary of my notes from the COSIMA TWG meeting today. Please add whatever I’ve missed/misrepresented (this is a wiki post so can be directly edited, or add a comment below to expand or add additional points to topics that were raised)

Date : 2023-02-08
Attendees : Micael Oliveira @micael, Andrew Kiss @aekiss, Angus Gibson @angus-g, Harshula Jayasuriya @harshula, Aidan Heerdegen @Aidan, Siobhan O’Farrell @sofarrell

Licensing

  • Restating of previous discussions. Choice between permissive and not-permissive has been made. Definitely want permissive. Now choice is between a license that requires changes to be contributed back when code is redistributed vs one where this is not required.
  • Decided need to provide these two choices and some contextual information, pros/cons etc, to main community to make decision. @micael agreed to start a topic outlining this information.

ERA5

  • JRA55-do will stop being updated in the near future (link to definitive info?). Groups such as GFDL are looking to develop an ocean driving product based on ERA5. Potentially makes the work to integrate ERA5 forcing into ACCESS-OM2 even more important:
    – ACCESS-OM2 will no longer be able to be run with up to date forcing product to investigate recent events of scientific and/or societal importance
    – ACCESS-OM3 is less well characterised/understood, which makes evaluating a forcing product more difficult. Comparing both products in ACCESS-OM2 makes it easier to assess differences between them, informing OM3 work.

Update on ACCESS-OM3 source build

  • All components now building using CMake. Except the driver, not yet done. Was going to write one, then saw a generic driver was available, ESMx, will investigate using this.

MPI libraries on Gadi

  • @micael did some benchmarks using the Octopus code, which has some similarities with MOM6 (it solves time-dependent differential equations on a grid using finite-differences). He sees that Intel MPI is slower than OpenMPI for communication (as expected) but faster for the memory and CPU bound parts of the code (somehow surprising). Despite slower communication, this leads to faster overall times for one specific case, using ~1000 cores. For profiling of MOM6, we should make sure to test both libraries.

Update on progress with ACCESS-OM2 spack build

  • @harshula gave a short overview of what is needed to make a spack package file, which is required to build a software project with spack. Illustrated this with two examples: libaccessom2 which has a complete and well configured build system, and oasis3-mct which does not.
  • Entire ACCESS-OM2 build completes automatically in a few minutes on his laptop with gcc compiler.

Not quite: communications are slower with Intel MPI, but the CPU and memory bound parts of the code (i.e., the ones that don’t perform any communications) are faster.

1 Like

Please edit the post to reflect what you said @micael.

I think it is better to make the “minutes” as accurate as possible, rather than require readers to look at all the comments to get the full picture. Doesn’t mean you have to delete the comment, it is useful to know there was a problem that was updated.

Thanks for the correction, I must’ve misunderstood you. That seems even weirder than I thought!

Summary of my notes from the COSIMA TWG meeting today. Please add whatever I’ve missed/misrepresented (this is a wiki post so can be directly edited, or add a comment below to expand or add additional points to topics that were raised)

Date : 2023-03-08
Attendees : Micael Oliveira @micael, Andrew Kiss @aekiss, Aidan Heerdegen @Aidan, Siobhan O’Farrell @sofarrell, Andy Hogg @AndyHoggANU, Adele Morrison @adele157, Kieran Ricardo @kieranricardo, Paul Leopardi @paulleopardi, Matt Chamberlain @matthew.chamberlain, Martin Dix @MartinDix, Dougie Squire @dougiesquire

MOM6-SIS2 scaling

@micael described scaling tests of regional pan-antarctic simulation

  • Scaling tests show how many cores it is possible to use while not wasting too many resources. Runs tried to stay close to real production runs, so that the time spent in Init and finalise is realistic
  • Ocean component scales better than the ice component and the atmospheric forcing. Not an issue as most time is spent in the ocean part.
  • Might be worth changing the IO layout to improve IO performance.
  • Rui Yang performed a more detailed analysis of what is happening in the code during one run (10th deg. with 962 cores). Main findings:
    • High MPI overhead: a lot of waiting due to load imbalance. This seems to come from the open-boundaries
    • Little vectorisation

Questions/Suggestions:

  • Is boundary forcing read in every timestep? Files are daily, not sure if they are being read in once for a day
  • Maybe optimise layout
  • Could try testing global 0.1 scaling and see what difference boundaries make
  • Could ask for advice on a MOM6 forum
  • Could we make boundary cells smaller, give them less work to do, or change affinity and assign more CPUs
  • Compressed and chunked in time? Angus made sure chunk size is 1 in time

ACCESS-OM3 Plans

  • A lot depends on CMIP7 post meeting
  • Interest in ACCESS-CM3 and ACCESS-ESM3 based on ACCESS-OM3. Tight timelines for IPCC reporting. Spin-up is time consuming. Hard to do in parallel and hit deadlines.
  • Possible is two-pronged approach, above and a fallback approach adding ESM components to CM2
  • No progress on decision yet. @AndyHoggANU working on notes to get out to people to get a decision. Chicken-egg problem
  • COSIMA can benefit from ACCESS-CM3, get OM3 up and running to show possible. Would be logical to use it. Timelines still quite uncertain
  • Hitting IPCC timeline means impact, but also international scrutiny of model, which is beneficial
  • COSIMA shift priorities to global, perhaps 1 deg global demonstrator which could be picked up by coupled model => implies shared codebase.
  • Also issue of porting of WOMBAT to MOM6.
  • CM work can start with NCAR 1 degree configuration @dougiesquire got working. Can use this as a test-bed to work out coupler, land-masks, fractional land cells etc. UM coupling still under development. Have working AMIP config working with a data model. Not proper two-way coupling yet. No technical challenges to it working.
  • UM work can be done in parallel with OM3 and bolted on later
  • Yes, COSIMA wants to go down this path to facilitate CMIP plan
  • Initial conditions ok. Forcing will use CDEPS datm and driver models to provide JRA or ERA5 forcing to the model. Not planning to port yatm. Should mean more forcing products can be used, and flexibility in blending forcing products.
  • Low res configs @dougiesquire has set up are using CESM input files and parameters. Do we want to use these, and topography? Or harmonise with previous OM2 configs?
  • Start with CESM inputs and gradually change?
  • Benefits: OM3 part of CM3/ESM3 would be a big tick on COSIMA grant proposal. Important achievement. If can know it possible can get ready in time, will have leveraging effect on community. If can’t manage, then impetus will fade. Needs thought about plan, who is available etc. If delay too much will not be an option. Risks in delay large. Risks in starting are small.
  • Planning 1/4, 1/10 and 1/25 global. Note sure if 1 degree necessary? 1 degree is useful from a technical pov, testing wombat etc.
  • @paulleopardi if 1 degree useful for optimisation, might be useful. Use fewer cores and vectors might end up similar length.
  • For climate/ESM 1 degree still the main workhorse. 0.25 used, but as high as currently used
  • Look at points of failure, technical challenges, performance bottlenecks. Identify pitfalls and how quickly they can be dealt with.
    – Some questions about maturity of C-grid CICE6? Coupling is via A-grid, even though model on C-grid. Might be dealt with at Bluelink meeting.
  • Should ACCESS-NRI formally be part of CICE6 consortium? Some resourcing requirements (FTEs per year)
  • Focus on CMIP7 potentially puts wav watch on the back burner. Work for CM3 doesn’t include a wave component. Already have coupled version with wavewatch, but will be focussing on coupled model parameterisations.
  • Wavewatch is important for atmospheric boundary layer. But don’t put in immediately. Could have there and turn on? Plan was always MOM6/CICE6 first, so sooner that is ready sooner we can add in Wavewatch.
  • Close to distributing a technially working ACCESS-OM3 with wavewatch that interested parties could test and feedback
  • Vision is to be able to switch out components for testing, so data ocean for testing wavewatch
  • What are next steps of MOM6/CICE6 configuration? @dougiesquire has a “thrown together” config, low level of confidence in outputs. Happy to share, but very likely to have issues. Made them work with payu.
  • Initial WW3 coupled MOM6-SIS2 config (GitHub - shuoli-code/MOM6_WW3_SIS2_coupled)
  • Configs using development version of WW3 for coupling to CICE6. Parameter settings seemed to be missing/confusing.
  • Put @dougiesquire’s config on GitHub? Will work out of the box with new build system executable
  • Build system is 90% done.
  • Merge payu driver.

ParallelIO in CICE6

  • Same as that in CICE5? Does that need to be worked on?
  • There is Parallel in CICE6. Not sure how performant it is.
  • Parallel scaling at high resolution a later concern, focussing on climate initially.

@micael I missed some of the detail of the scaling stuff at the beginning, so please do fix that up, I was intentionally fairly vague because I wasn’t exactly sure.

1 Like

Notes from the COSIMA TWG meeting today. Feel free to add whatever I’ve missed or to modify anything that I got wrong.

Date : 2023-04-12
Attendees : Micael Oliveira @micael, Andrew Kiss @aekiss, Aidan Heerdegen @Aidan, Paul Leopardi @paulleopardi, Martin Dix @MartinDix, Dougie Squire @dougiesquire, Rui Yang @rui.yang, Russ Fiedler @russfiedler, Angus Gibson @angus-g

ACCESS-OM3 Status

  • We now have something that compiles.
  • We are using the CESM driver.
  • There are still a few things to fix, but it should be possible to test the model.
  • Next step is to try to run some tests.
  • Spack is being used to build some of the dependencies on Gadi.
  • Spack configuration has some issues:
    • Packages are installed on /scratch.
    • Some environment modules do not modify the paths, so the libraries cannot be find by the loader
  • payu support also needs some fixes and improvements.

MOM6-SIS2 scaling and profiling

  • Still not clear what is causing the MPI imbalance.
  • Imbalance present both in global and regional model.
  • Rui Yang is looking into profiling the I/O.
  • Micael will get in touch with Marshall Ward to see if he has any ideas.

ACCESS-NRI Intake catalog

  • ACCESS-NRI wants to provide a catalog of intake-esm catalogs.
  • Dougie has been working on this and has a proof-of-concept ready.
  • Feedback is most welcome at this stage.

Hi folks,
Not sure if this is the right place to ask but I see WW3 mentioned in these notes…

I’m aware ACCESS-NRI has plans to couple WW3 to ACCESS, is that part of the full coupled model/ESM, or specifically in the COSIMA space, ie ACCESS-OM?
Asking as I know things have moved on a lot in recent years with hooks being added for OASIS coupling for the UM (e.g. Extend parameter set for coupled modelling · Issue #206 · NOAA-EMC/WW3 (github.com)), just as a sort of FYI we had a postdoc here in CSIRO about 8 years ago who did early investigations into coupling ACCESS to WW3 (focussing on the atmospheric side), I really need to reclaim some disk space on Gadi so I’m just in the process of archiving her work. It’s back with v7.3 of the UM so it’s probably so out of date to be useless anyway but I thought I’d just put a callout here in case anyone wants it kept slightly more accessible for reference in future NRI work?
Honestly I think recent progress in WW3 would negate anything she’s done, I’m just keeping it for historical record, but if it’s of interest please let me know.

cheers
Claire

1 Like

For ACCESS-OM3 (MOM6-CICE6-WW3) COSIMA has switched to using ESMF/NUOPC coupling (specifically CMEPS) instead of OASIS, as discussed here. The vision is for this to be coupled to the UM (again with NUOPC) for ACCESS-CM3 - e.g. see slides 17-22 of my presentation at the 2022 COSIMA workshop. ACCESS-OM3 is in very early development, and currently using the latest development version of WW3, i.e. newer than the current 6.07.1 release which is 4 years old.

1 Like

Thanks Andrew, that’s useful. Most of my info on COSIMA is from your Bluelink Science Meeting preso which did not go into so much detail :slight_smile:
And yeah, it’s pretty hard to justify using the current production version of WW3 for much developmental work at the moment given its age and the large amount of additional capability that’s since been added!
Thanks, I will duly consider Elodie’s work to be entirely historical and proceed to archive.

Notes from last week’s COSIMA TWG. Feel free to add whatever I’ve missed or to modify anything that I got wrong.

Date : 2023-08-09
Attendees : Micael Oliveira @micael, Andrew Kiss @aekiss, Aidan Heerdegen @Aidan, Paul Leopardi @paulleopardi, Dougie Squire @dougiesquire, Rui Yang @rui.yang, Angus Gibson @angus-g, Harshula Jayasuriya @Harshula, Ezhil Kannadasan @ezhilsabareesh8, Siobhan O’Farrell @sofarrell

MOM6 Test Cases

  • Call is out for MOM6 test cases (see A call for community model configurations that use MOM6)
  • Question about how to do run the tests in practice from a technical point of view (see Technical requirements for MOM6 node testing):
    • How to launch test on gadi now that access-dev is going to be decommissioned? Should we use github actions?
    • ACCESS-NRI is working on setting up CI infrastructure (VM on Nirin running github runners)
    • Tests can be set up while waiting for infrastructure. Start with one test (e.g., Panan) to work out process.
    • Should tests go to a git repository, just like mom6-examples? Too much maintenance?
      • Depends on how similar tests are to production. Should be fairly close, so production configs get tested.
      • Maybe we could patch production configs for tests - eg shorter run, fewer diagnostics
      • We should use Payu. NRI needs to be broader, since other models use rose/cylc.
    • How to deal with cases that aren’t bitwise reproducible, but close enough? We can start with manual checks. MOM6 PR’s usually have notes as to whether bitwise reproducibility is expected
    • MOM consortium requires nomination of someone to sign off on tests. Suggestion for Angus to do it, plus someone else as backup.

ACCESS-OM3 Status

  • Setting up new configuration to be close to OM2
    • grids etc done, JRA55-do done, separate PRs, need to merge once crash is fixed; parameter settings still need attention.
  • Ongoing comparison of ACCESS-OM3 and CIME builds. Still some differences in results, but closing in. Checking compiler flags, including for dependencies.
  • Found mismatched constants between model components
    • defaults in code; can be overridden via input at runtime
    • see what parameters CIME fixes
  • Will tag configurations and executables matching CIME builds as a starting point, and have history for reproducibility.
  • Shioban noted that CICE6-WW3 configuration doesn’t run properly because DOCN needs to supply additional fields, and NUOPC cap for CICE6 needs options to pick up ocean currents from DOCN.

ACCESS-OM3 Release Process

  • A release of ACCESS-OM3 should include all the things needed to reproduce a given run: codebase, spack env, configs, inputs.
  • We would like to have the same version number in all tags
  • Open questions:
    • how to version the inputs?
    • how about CI docker images - tag those?
  • ACCESS-NRI release team working on these issues.
    • If NRI does releases, NRI will take responsibility for tests, updates to software stack
    • OM3 can be a test case to see what works
1 Like

Notes from last week’s COSIMA TWG. Feel free to add whatever I’ve missed or to modify anything that I got wrong.

Date : 2023-09-13
Attendees : Micael Oliveira @micael, Andrew Kiss @aekiss, Aidan Heerdegen @Aidan, Dougie Squire @dougiesquire, Angus Gibson @angus-g, Ezhil Kannadasan @ezhilsabareesh8, Siobhan O’Farrell @sofarrell, Jo Basevi @jo-basevi , Martin Dix @MartinDix

ACCESS-OM3 Update

  • 2 releases:
    • 0.1.0: this mimics the CESM version we started with (over one year old)
    • 0.2.0: updated to newer CESM - nearly cutting edge
  • Refined release process:
    • all inputs, exes, configs, spack, CI tagged simultaneously, even if unchanged from previous tag
    • input directories are also named for the same tag
    • development tag is x, e.g 0.x.0 is 0.* dev branch
  • Work on configurations:
    • One git repository for config, different flavours (e.g., forcing, grid resolution) are branches
    • Currently 2 long-lived branches in MOM6-CICE6
      • the CESM compset, unmodified - for testing only
      • 1deg JRA55do RYF - OM3 candidate
    • work already done to be like OM2, but need to update these to be compatible with 0.2.0
    • main not used for configurations - just a README explaining to check out a branch
    • some documentation on git practices Git practices · COSIMA/access-om3 Wiki · GitHub

AH: suggests as you are already using complete paths in config.yaml to individual files (not dirs) you can then do away with having a tagged dir for each release, and only update paths to files that have changed. You can use a database for finding out which configs use a given input file.
MO: we’ll see how we go with the current plan - it’s not that onerous.

ACCESS-OM3 Plans

  • Short term:
    • keep working on configs
    • parallelisation, scalability, processor layout options with NUOPC
    • at present all components are using all 48 cores, so components run one after the other

AK: would be good to check scaling for the sort of core count we expect to use, eg ~200 cores for 1deg. Want to use more than 1 node.
MO: need time per iteration for each component as a function of core count, so concurrently running components complete in a similar time. Currently all components run in serial, whereas in OM2 they run in parallel. Probably we want a combination of both.
DS: there are files in config giving core counts for components on different machines - could be useful as a reference for scaling. For fully active config, atmospheric component is hardwired in driver to never run concurrently with land or ice, so they should be overlapped on PEs.

MD: Kieran finds CICE restart makes whole thing grind to a halt
MO: is CICE using PIO?
MD: unsure.
DS: compiling with CIME.
MO: should be PIO then.

MD: can use current CESM-based OM3 now for coupling with UM.
AH: could start using spack to help with build.
MO:there’s a lot of logic in the cmake that you don’t want in spack - cross-dependencies between components - compilation needs to be in a particular order - can’t just compile all components separately and then link. eg driver needs to come last. And there are a couple of patches.
MO: easy to change cmake to compile all components or a subset as library without driver.

Payu Updates

  • New topic for payu updates: Payu updates at NCI - #2 by Aidan
  • “module use” implemented
  • Jo working on auto-archiving outputs from scratch by payu - should replace sync scripts
  • date-based restart pruning, following tidy_restarts - can specify pandas-style time frequency
  • issues to resolve re. collation and sync of final restarts
  • future plans: embedding and tracking uuids for reproducibility / provenance
  • facilitate multiple experiments per control directory: automatically create run branches based on uuid, also name work and archive directories with uuid to work around limited name-space issues

MO: runlogs - should only be on for production runs, not development test runs
• we’ll force users to create their run fork (or at least branch), without needing payu
• forbid direct pushes to config branches - require PRs
• runlog off by default
• config main branch only has README explaining that a branch needs to be checked out and runlog activated
AH:
• this saves work for a few developers but adds work for many users
• recommends activating runlog by default, as users will forget
MO: can be implemented by turning runlog on in tagged commits, since users should not be using dev branches.
AH: this could be a CI check.

1 Like