COSIMA TWG Meeting Minutes 2024

Summary from TWG meeting today - please correct and elaborate as needed.

Date: 2024-01-17

Attendees:

MO

  • profiling: runs hanging when CICE has >76 cores, apparently when reading from the mesh, even when IO is serial. Unlikely to use this many cores at 1 deg so capped profiling to max 76 CICE cores.
  • will probably want to run MOM & CICE concurrently rather than sequentially, since MOM6 scales better than CICE6, and therefore give more cores to MOM6 than CICE6 - as was done in ACCESS-OM2.
  • expect to have scaling plots to show next week
  • AH: whether to run atm sequentially or concurrently depends on relative resolution of ocean and atm

EK: WW3

  • suggestions from Stefan (BOM) - eg turn on ice scattering and dissipation in wave model
  • have 5 dissipation models and 2 scattering models - trying these out
  • implications on what fields are passed from CICE → WW3
  • but floe size not active in CICE6 yet - tried turning on - requires passing wave spectrum to CICE - this is happening but unsure what spectral types to transfer
  • SO: will chat to @NoahDay & @lgbennetts about this; also there was a relevant presentation from Cecilia Bitz at MOSSI2022 [but not on the MOSSI youtube channel] re. dissipation option 4; definitely want to put spectrum into CICE6 but unclear whether full spectrum is needed
  • EK: has confirmed the full spectrum (25 components) is getting through to CICE
  • SO: will confirm if @NoahDay used floe size bins the same as current CICE defaults
  • EK: for testing has set categories to 12 based on Noahs experience
  • enabled floe size distribution output
  • SO: not all WW3 options require FSD, some can work with default option (all floes the same notional size), but CICE would not interact with waves without FSD
  • EK can also read in a netcdf FSD file
  • SO: can generate internal spectrum within CICE
  • Stefan also suggested disabling interpolation in time as not suitable with coupling; also not sure about Langmuir turbulence and non-breaking-wave induced mixing; currently turned on; also 11 parameters in MOM6 for that, following Shuo Li’s config
  • AH: non-breaking wave mixing is controversial - unclear whether to use this
  • SO: would also need to look into how it’s hooked into turbulence closure in MOM - check what the US teams are doing - will find contacts to follow up

HJ:

  • Spack v0.21 failed to compile parallelio with the netcdf version we are using - is it ok to pin to the old working version or should we use a newer one?
  • MO: if versioning is truly semantic it should be fine
  • AS: CICE6 says 2.5 & 2.6 are supported so 2.5.10 should be fine for CICE6

AS:

  • openmpi symlinks bug raised last year - a patch now done but 4.1.7 won’t be released until late Q1 2024
  • patch in mom6-cice6 config to turn on netcdf4 and parallel read/write - seems to improve performance; will be more significant with more cores
  • when updated can also turn on parallelio for other components
  • working with CICE people to improve netcdf error handling
  • looking into wave-ice interaction with EK
  • merging updates into ww3 configs
  • might be a bug in ww3 history output
  • MO: how long will it take NCI to install updated openmpi
  • AS: will request from NCI helpdesk when released

DS:

  • added coupling diagnostics - revealed unexpected things but sorted out - just misinterpretations
  • coupling side of wombat ready to have somebody look at - will also ask NCAR people to look at it
  • started porting wombat to generic tracers
  • going on 3wks leave at start of Feb
  • AH: CESM MOM6 workshop coming up - won’t be sending anyone in person but would be good to have a presentation to update people
  • AK: overlaps with AMOS - presentations could be similar - I’ll share slides and welcome any input
  • porting wombat to generic tracers - need to decide how to handle wombat versioning - eg @pearseb has updated to have ~25 tracers - will initially port the old version before updating - the old version system used include files but this isn’t working anymore - versioning a bit confusing
  • proposal: 1st port old wombat to generic tracers, tag it, then update to Pearse’s version
  • merging versions could be difficult - need to arrange a meeting
  • want to share with NCAR before going on leave

CMIP timeline

  • AH: will be a meeting at AMOS (lunchtime Tues) re. fast track v2 CMIP which might shed light on timelines
  • AH: running candidate models ready for testing ideally by end of June 2024 but maybe as late as Dec
  • AS: need to clarify what we are aiming at with OM3
  • DS: high-level planning meeting sometime soon would be useful
  • AH: broader meeting than TWG - include a few community folks
  • AH: would be good to have an overview of what needs doing and a straw-man timeline
  • MO, AS: good to have a task list and timeline eg gantt chart eg on github — see this topic to discuss

MO: creation of 1 or 2 repos for helper scripts/utils

  • already have one (om3utils) but want it to be a tested, documented python package
  • also need a place to dump scripts eg for reproducibility
  • propose having both, and moving things between them as needed
  • DS: can become unclear where to find things unless scope of each repo is clearly defined
  • MO: want package to be useful and reuseable
  • DS: don’t all scripts already meet that?
  • MO: maybe not
  • DS: a dump repo would need still need structure (subdirs), docs (READMEs) and code review as they need to be working, and should link git commit in metadata of output
  • MO: agreed - call it om3-scripts
  • DS: include environment.yaml to record the conda env
  • DS: at some point we may want to bundle scripts into a python package
  • MO: will move om3-utils repo to COSIMA org - should be reviewed

Next meeting
1-2pm Wed 31 Jan to avoid AMOS

Summary from TWG - please correct and elaborate as needed.

Date: 2024-01-31

Attendees:

The meeting was entirely dedicated to discussing the ACCESS-OM3 development
timeline and OM3 priorities for contributing to ACCESS-CM3 and ACCESS-ESM3 in
time for CMIP7.

Current plans for OM3 configurations

AK shared some slides related to model development workflow and planned OM3
configurations:

AH:

  • OM3 is currently in the “Model Timesteps” stage (see slides above).
  • OM3 1deg configuration at the “Preliminary Optimization” stage.
  • CM3 is not there yet and some work remains to be done.

MD: OM3 configurations are still based on the corresponding OM2 ones. Need to
change the grid type to take advantage of C-grid features.

AH:

  • There are several planned OM3 configurations, plus some configurations that might be develop or not.
  • A MOM6-CICE6 configuration will be used for CM3
  • A MOM6-CICE6-WOMBAT configuration will be used for ESM3
  • Configuration for CMIP will be 0.25deg MOM6-CICE6-WOMBAT

Scientific options and configuration development

AM: Will any of these configurations include ice-shelf cavities? Maybe a new
line is needed in the table?

SO: Ice-shelf cavities require high resolution. 0.25deg might not be enough.

MD: What scientific options do we want to use for CMIP?

AS: C-grid for sure. Too early to add landfast ice.

AH: Need to distinguish:

  • scientific options for configurations that we want to facilitate that are of interest for the research community (e.g. land-fast ice, waves)
  • scientific options to explore for CMIP (e.g. c-grid)

AH: Regarding MOM6, we might explore different vertical coordinates (isopycnal vs Z*)
AM:

  • We could do like for the MOM6-SIS2 global configuration: run with different coordinates and compare results.
  • We could also compare with NCAR/GFDL configurations. Good to do during the optimization step of the configuration development.

AH: At which phase to do that?
SO: At preliminary optimization/evaluation.

AM: At which resolution? Is 0.25deg worth doing? Scientific community is more interested in 0.1deg.
AH:

  • The question is rather in which order we do the work.
  • We can probably reuse a lot of the work done for 0.25deg for the 0.1deg configurations.

DS: There are lots of options to test for WOMBAT.

AS: How quickly do we want to have the more experimental ice-related options?
AM: We want them for the 0.1deg configuration for the scientific community.

MD: It’s probably not worth putting too much effort into the 1deg configurations.
SI: Keep it as a fast option, but not for CMIP 7.
AK: 1deg could be very useful for tests and continuous integration.

OM3 CMIP 7 Timeline
AH: Timeline? What to prioritize: 0.25deg MOM6-CICE6 → 0.25deg MOM6-CICE6-WOMBAT

AK: Cheap configuration needed for testing when updating codebase.

DS: What work is required to go from one resolution to another?

AH/AK: For OM2, most work went into updating the topography. Then remapping
weights for OASIS exchange grids and tuning.

Consensus: 1deg MOM6-CICE6 → 0.25deg MOM6-CICE6 → 0.25deg MOM6-CICE6-WOMBAT

AH: What priority for WW3 configurations?
AK: only useful for scientific community interested in waves. We can probably keep these configurations in sync with the others.
AH: Need to ask community about interest.

AH: Proposal to have CMIP7 configurations ready for full evaluation by
mid-year.

AH: Will WOMBAT be okay with this?
AM: Will need WOMBAT by mid-year or later?
MD/AH: Might come a bit later.

DS: Do we need to update the grids and topography?
AH: Yes.
MO: We have the tools and the workflow. Now just need to do it.

AK: Still missing C-grid in CICE.
AS: This looks doable
AK: There are some known drawbacks of using C-grid. Need to be aware of it.
AS: C-grid in CICE is, as a feature, considered finished.
AK: But not all features are available for C-grid.
AK: Issue with mediator/coupler as it uses A-grids internally
KR: With CMEPS, all fields need to be on the same grid.

Task assignment
Minghan: 0.25deg configuration
Micael: topography and grids, scaling and performance optimization
Anton: CICE
Dougie: WOMBAT
Ehzil: ?

Project management
AK: We will set up a project dedicated to CMIP7 on the COSIMA Github
organization. All members should try to update existing issues and add missing
issues.

Next meeting
Back to usual schedule: 11am Wed 14 Feb.

1 Like

COSIMA TWG

Date: 2024-02-14

Attendees: Andrew Kiss, Anton Steketee, Micael Oliveira, Minghang Li, Martin Dix, Harshula, Angus, Ezhil Kannadasan, Aidan (Apologies Dougie, AH)

0.25 Degree Config:


AK has started issue #101 in the ACCESS-OM3 repository: develop MOM6-CICE6 025deg_jra55do_ryf based around the MOM6-CICE6 1 degree configuration 1deg_jra55do_ryf and the ACCESS-OM2 0.25° configuration.

Project Board:


AS: We could add analysis notebooks + some regular runs on 1 deg configuration

AK: For OM2 we had a figures directory in the ACCESS-OM2 report repo with the metrics of interest, aimed to be a living approach but one that is mutually acceptable. Probably need to start a new repo for that. We could also include performance metrics.

  • One notebook for each figure / very simple metrics can work well.

  • Start with manually generated intake catalogue - for om3 analysis.

Aidan: This could be an application for the MED Team (Mike Tetley’s) live diagnostics tool or jupyter intake scripts. We could add a hook from payu to generate an intake catalogue + load this into the live diagnostics. Initial approach: use Intake, manually generate intake catalog; then use payu auto-catalog when available

  • Action for Anton : Follow up with Romain + Mike + Aidan.

Processor Layout:


MO: Writing scripts / tools to analyse parallel performance (currently in a branch of the om3-utils git repo). Using the trace generated by ESMF. Which includes all the things that go through the driver & mediator. The profiling separates the timing into code ‘regions’ (e.g. for coupling, timesteps, etc)

Adding cores mostly very poor efficiency. MO proposes shifting to running ice+atm+ocean simultaneously, with just enough cores that ice + atm run faster than ocean.

MOM6 is approx 5x slower than in access-om2 per model year. This is with a different number of time steps and MOM6 is fairly different compared to MOM5. MOM6 is / would be dominating the timing if components run simultaneously, so looking at the internal timing within MOM could be the next step. Otherwise we could go straight to a 0.25 degree config, rather than work too much on optimising at 1 degree.
MD asked whether it was related to nuopc, but we don’t think so. ESMF profiling specifically isolates MOM6 timestepping as the culprit.

Angus suggested investigating compiler flags, or more comprehensive profiling in MOM.

Aidan suggested investigating the IO but MO has checked that the profiling IO is mostly outside the MOM code.

Spack:


Harshula:

  • Transitioning to new spack versions (v0.20 to v0.21).
  • For OM2: Downgrading to PIO 2.5.2 from 2.5.10 (forum post). Plus removing nci-openmpi psuedo package, and using the openmpi system build directly. At this point this only impacts the ACCESS-NRI OM2 build. The ACCESS-NRI & COSIMA build give identical data outputs.

Repro-CI


Aidan: Adding repro-CI for access-om2 to test compilation and bitwise reproducability for output / results from OM2 runs. This will allow comparisons between versions + code changes etc and is something we will need to investigate for OM3.

Waves


Andrew Kiss - spoke to Alberto Meucci from Uni Melb at AMOS who is interested in our wave parameter choices and output. Meeting to be organised to gather input from wave modellers on our WW3 parameter choices.

Next Meeting: March 6

Summary from TWG meeting today. I definitely missed some of the details of the Spack discussion. Please correct and elaborate as needed.

Date: 2024-03-06

Attendees:

1deg MOM6-CICE6 scaling

MO shared some results

  • MOM6 scaling plot:
    • Time taken
      • MOM total runtime includes time waiting for other components. Other components don’t include
      • MOM6 takes 80-90% of total runtime
    • Parallel efficiency
      • Going to more that one core drops efficiency (more comms, regions of serial code). Possibly drops more than we would like
      • Region with worst efficiency is ocean surface forcing - don’t remember seeing anything like this for pan-antarctic configs. AK: probably not IO. AH: There were OM2 issues with chunking that NH had to fix. AK: that would be outside of the issue region.
    • Fraction of time spent in different regions
      • Surface forcing takes more and more time with ncpus.
      • AHogg: something happening beyond a single node. MO: need to investigate
      • DS: Does this include ocean surface forcing stuff in MOM NUOPC cap. MO: No, think it’s the stuff in MOM
  • Varied number of cores assigned to each component, keeping total number of cores the same
    • The ocean benefits the most from having more cores
    • Conclusion: give lots of cores to ocean and only a few to everything else
  • OCN-MED exchange
    • Giving more cores to ocean makes OCN-to-MED faster, MED-to-OCN slower

MD: How does runtime compare when running in parallel on a single node, relative to running components sequentially? MO: roughly the same

MO: will put issue on GitHub with summary and plots

AHogg: Nice framework for doing this for other models (e.g. CM3). MO: ESMF level profiling (e.g. timing on NUOPC phases) will be available for all models, but degree of profiling within a component depends on what’s implemented in that component.

AHogg: Can we do some long runs? MO: yes, things are still cheap even if they’re inefficient. Increasing the timestep is an obvious low hanging fruit for getting things more efficient - currently timestep is shorter than OM2. AHogg: We have compute. Let’s get some longer runs underway.

OM3 releases and component updates

MO: 6months since last component update. Lot’s of new stuff. Worth doing another update? Main issue is that configuration will require changes. DS: only going to get more disruptive, so yes MO: do we want to update to latest CESM version or latest version of components? AK: last time I checked, CESM wasn’t up to latest CICE that includes C grid. AS: my parallel IO and date bug work will only be in latest CICE. DS: let’s open an issue to keep track of what versions are being used and keep track of the process? MO: yes, let’s do this any time we want to update components.

AH: So there’s a requirement that everyone is working from the same versions. MO: No, there’s a process for tagging versions, but developers can choose their own versions and build themselves. For next update of components, I will do a release - suggest that in parallel ACCESS-NRI release team goes through the process themselves and see how things work out. AH: Sure, but we have a requirement for an ACCESS-OM3 spack package. MO: That exists. AH: What about dependencies? MO: external dependencies (e.g. ESMF,FMS, PIO) are taken from spack packages, model components are pulled as git sub-modules. AS: Harshula would prefer that individual model components are built with individual spack packages. MO: We will never be able to say that OM3 is just a list of spack dependencies because it’s a single exe and we apply patches to individual components at build time. AHogg: this will also be the case for ESM3, CM3 etc. So we need a process for this. AH: the sooner we start the better. Probably can compile to a single executable with current design. What’s important is that all dependencies are handled by spack so that we can easily switch between versions. Now’s the time for OM3 dev team and release team to start working out how releases will work. A user story from OM3 developers would be helpful to the release team to help them improve workflows etc. MO: we are currently using the ACCESS-OM3 spack package to build OM3.

Where to open issues and replicating updates across configs

DS: Would be good to have guidelines around where to open OM3 issues. Currently have issue spread across configuration repos and the access-om3 repo. AK: I’ve been putting things on access-om3 DS: Shall we only open issues in config repo if the issue is only relevant to that config, everything else in the access-om3 repo? MO: Preference to open issue where you plan to open PR TG: Could you use a platform (e.g. Zenhub) to group issues? AS: Let’s try raising everything in access-om3 repo and see how that goes.

DS: it would be good to set up a workflow on Github to automate cherry-picking commits across configs. All: Agreed. AS: we should have reproducibility CI first.

Reproducibility CI

DS: we need reproducibility CI for ACCESS-OM3 configs urgently. We’ve already accidentally merge a few config-breaking PRS. The ACCESS-NRI release team have set up infrastructure for this. Let’s use it. AH: big issue is where to run - currently ssh into Gadi. Need to store ssh secrets etc in the repo. Anywhere you do that might need to be pretty well locked down (limited who has access). Or schedule tests from a fork on ACCESS-NRI org? AS: Is it not possible to set up a github runner avoid needing to ssh. TG: We looked at that. There are a few security holes in that approach. AH: Will probably do that in the future, but it doesn’t really solve the need to have privileged access. AH: There are also complications around using the CI Gadi account - admins don’t like attaching multiple projects. That’s why we have everything in vk83. AH: We only run checks when we open a PR from a dev branch to a release branch. Would it suit your use case to schedule tests? MO: Ideally not. We’d like to run with every PR. Should be possible on Github. Should also set up short tests that run on GitHub runners, e.g. Payu setup and checks. Would have to mock filesystem or something to get things to work.

AH: I’ve gone with a different layout of inputs for ACCESS-OM2. Might make it difficult to have a seamless transition from COSIMA to ACCESS-NRI.

DS: DS and AS to meet and chat about ACCESS-OM3 repro CI then reach out to release team.

AK: Note that the release team has found that OM2 doesn’t reproduce across restarts. There’s a whole range of what we mean by “reproduce” - need a whole suite of tests

Documentation

AK: Putting together something as a discussion on ACCESS-OM3. Need a coordinated way to document what we’ve done and how people can use it. Also need versioning and need to keep documentation synchronised. AH: Heads up: working through the versioning currently - Think we have a model for that that will allow us to update old versions. MO: Re documentation sync: standard approach is to put documentation source in repo and use Sphinx or mk_docs to deploy to GitHub pages or rtd. Questions: do we keep everything in om3 repo - do we also have config-specific documentation?

Next meeting

Next meeting date may be changed as a few away. Will update time in announce topic.

TWG summary from last week - a bit scrappy and incomplete so please fill in anything I missed.

Date: 2024-03-21

Attendees:

Performance scaling

ML - working on Micael’s performance tools, trying to reproduce results - 3 issues

  1. cesm driver fails to transfer some settings correctly to esmf - inconsistent with esmf docs - has worked out a workaround
  2. env vars in env section of config.yaml
  3. can’t run with >64 cores - hangs - cice problem?

suggests documenting these issues

MO: 1 a known issue - runconfig profiling settings ignored - need to use env vars

MO, AS: 3. a known problem in cice - can’t use >76 cice cores. Not a hard limit - due to a parameter setting - to do with not using roundrobin; not relevant since cice doesn’t scale to that core count anyway

MO: surface forcing the culprit for bad MOM6 scaling - specific to NUOPC cap; not seen in panan etc which use FMS - now trying to identify in more detail - a lot of load imbalance - worse if launching many jobs at once - an IO issue?? but nothing obviously IO related in code region. mom_surface_forcing file. Adding extra profiling regions. Trial and error.

DS: cap converts ESMF fields to MOM fields

AS: is it reading salt restoring

DS: looks like it - there are some salinity restoring io calls (see time_interp_external)

Model evaluation

AK: ENKF-C may be worth looking at for model-obs comparison

DS: Clothilde was planning to try this for eReefs - see how they went with it

Input directory structure

DS: issue with moving all inputs to vk83 for repro CI - how to structure it? Poll - vote! issue Move inputs to `vk83` · Issue #115 · COSIMA/access-om3 · GitHub

  • option A: version at top level
  • option B: version at innermost level

explicit full path specification for all individual input files in config.yaml

MO: sandbox 0.x.0 → 0.2.0 easier if versioned at top level but no strong pref

DS: linking version of input to version of exe - might be a pain if we want to do a lot of updates. But flipping could lead to a lot of versions that never really got used

AK: use symlinks?

AHeer: Kelsey say symlinks will burn you in the end. Flipped model (option B) is easier and clearer for users and doesn’t need symlinks. That’s what is being done and best for OM2 release

AH: let’s just go with flipped (option B) then, since no strong opinions

MO: sandbox could be useful for dev - some way to build test exes / configs to play with without doing a release - how we set things up for devs can be independent of how we do releases

AHeer: has to be on vk85 or tm70

AS: git-lfs ? each dev with their own fork?

AHeer: quickly run into file size limits with high res

DS: have to pay for lots of storage - not too expensive, maybe $5-10/mo for OM3 (without forcing files)

AHeer: try it out?

AH: happy to cover storage charges

AHeer: Or Tiledb - that does actual diffs on binary files (unlike git-lfs) - has a free version

AHeer: both manifests and git-lfs store hashes

DS: but git-lfs also stores revision history

AS: each file change doubles the storage for that file (doesn’t store deltas)

DS: could get very expensive - to investigate before deciding - actually probably unaffordable Slack

Namelist disussion: diabatic_first

DS: Namelist disussion: diabatic_first - we set to true - do we mind if we set it false (default) as it changes order of ops for generic tracers to be closer to MOM5-wombat - updates tracers in dynamic step

AH: don’t know why this is true

DS: setting comes from ncar - all our cosima mom6 configs and mom6-examples have it false

AH: will ask Marshall

AK: is it related to NUOPC cap?

DS: will check

Restart issue

EK: looking into restart file issue and looking at parameter and 0.25 restart - runs well except for restart

Next meeting

3 April, usual time

Summary of today’s TWG - I didn’t catch everything so please edit to add/correct as needed.

Date: 2024-04-03

Attendees:

Offline BGC

DS, AK: MOM6 offline tracers to be explored - potentially very useful capability for BGC dev, parameter tuning, spinup (esp. CMIP7) and science - see Offline tracer transport for BGC · Issue #123 · COSIMA/access-om3 · GitHub

ACCESS-OM3 component update

MO: updating model components, following CESM

  • updating spack env
  • CESM has own fork of FMS but not easy to mix so updated to latest stable FMS release from GFDL - seems to work - now need to compile FMS with special config options to activate the old API
    • DS: MOM abstracts the version (FMS1 vs 2) - is this FMS3?
    • MO: not sure
  • Should we build CESM with these updated components, now that we’ve diverged with our new configs?
    • AK: only do that if we need to run CESM configs for debugging one of our OM3 configs
    • AS: field dict has changed and would need updating

MOM6 version choice and MOM6 node

DS: are we still happy with tracking CESM? What about when we have our own MOM6 node?

  • AG: GFDL say we need to nominate somebody to sign off on PRs. Then up to us to set up test infrastructure to approve PRs.
  • AM: Would be good to do - should it be by AG or somebody at NRI?
  • AK: wait until we have adopted NRI’s test framework from OM2
  • AH: this is underway
  • AHogg: will our system meet requirements?
  • AG: no formal requirement - we just need to be happy with it
  • AH: using pytest - can be controlled via workflow dispatch, very flexible
  • MO: Need 1 example to run, and a way to run it, then can expand to other examples
  • AHogg: would be good to have AG as one of the approvers, but also to have NRI. Get Tommy’s testing/deployment running, then tell GFDL we’re ready.
  • AS: once test infrastructure is established we can incrementally add tests to suit what we care about
  • MO: CESM is currently using nearly the latest MOM6 so currently no big motivation to use GFDL - but might not always be the case
  • DS: will there ever be things in the NCAR CESM fork we need that are not in the main
  • AH: are MOM6 nodes obliged to run MOM6-main?
  • AG: not necessarily - GFDL use a much newer dev branch but there are periodic PRs to main for everyone to approve

Profiling & benchmarking

MO:

  • one region of MOM6 code (surface forcing) was not scaling with more cores - narrowed down to reproducible sum
  • but config set CICE6 max_blocks to a very large number - allocates a lot of memory - huge CICE6 mem footprint - then affected MOM6 apparently because CICE mem too big for cache to hold both MOM6 and CICE6 data
  • resolved by a more reasonable max_blocks: parallel scaling improved, but still not great
  • profiling paused for now as MOM6 now has a new feature to mask land tiles automatically at runtime to match number of core - want to use this for profiling. MOM6 land proc mask not relevant to CICE which uses a very different approach. Newer CICE6 can also automatically determine max_blocks.
  • AH: NetCDF chunk size in output is auto-determined - mppnccombine-fast assumes the same proc land mask for all files
  • MO: but auto-masking sets io cores (io_layout) to 1 - not what we want for production but useful for profiling, and we can read in previously auto-generated proc land mask in production configs
  • AH: have we looked into parallel io for MOM6?
  • MO: not sure, and might not be performant to gather to one core and then redistribute for parallel io

Documentation

  • MO, AK: 2 main options (1 is AK’s preference) - see Documentation · COSIMA/access-om3 · Discussion #120 · GitHub
  • AH: can it defined in a datastructure? say, doc.yaml in each branch - also makes it easier to systematically extract data
  • DS: or no specificity at all - just have one doc with a common section followed by a section for each config which is free-form text extracted from each repo branch
  • AH: maintainability a problem if free-form, and unclear to doc writer what is needed
  • AH: use submodules?
  • MO: not simple to use sphinx or mkdocs with submodules
  • AHogg: try something and see how it works - how it looks to the user and how much work to update continuously
  • AHogg: and will this scale to the other ACCESS models? eg ACCESS-CM3 and ACCESS-ESM3
  • MD: no discussion of documentation for CM/ESM yet - would likely follow OM3’s lead
  • AH: may be fewer configs in climate models since they don’t have choice of forcing?
  • AHogg: but in future there will be multiple resolutions
  • MD: ESM will have a lot more configs than CM

Licensing

AH: what licence to release OM2 under? Software licence · Issue #264 · COSIMA/access-om2 · GitHub

  • AH: MOM5 is GPL3
  • MO: so no choice - code must be distributed under terms of GPL3. So need to check that all component licences are compatible - and they are.
  • AH: what about configs?
  • MO: doesn’t matter - these are input, not code - might have IP but not something we need to deal with. Are we distributing code? Licences are about distribution, not the use it is put to.
  • AH: were there custom licences that weren’t being adhered to? CICE? OASIS?
  • AS: does new CICE licence not matter since it didn’t have one back when we forked the code? Licence was added about 2017ish.
  • MO: there’s an issue about this - Not complying with licensing · Issue #67 · COSIMA/cice5 · GitHub
  • AH: so just need a GPL-compatible licence for OM2?
  • MO: might not strictly need a license for OM2 but it’s easy to add one
  • AH: yes clearer just to have one
  • DS: there is a little code in the configs, eg shell scripts
  • AH: so use Apache? CMIP - data has different licence (eg cc-by) from the code

Next meeting

17 April, usual time

Summary of today’s TWG - I didn’t catch everything so please edit to add/correct as needed.

Date: 2024-05-01

Attendees:

New Meeting Organiser

AS will take on organising / hosting meetings. MO has moved to a different role at ACCESS-NRI

CM3 - Update

KR has run the prototype CM3 for two model years - with CICE6, MOM6 coupled to the UM. There is a energy balance issue showing as SST growth / warm bias.

Appears to be heading towards similar performance on CM2, but will need work to optimise this. Its using 576 cores, but running sequentially, rather than end goal of simultaneously. Parallel efficiency is ~60% above 96 cores. (In CM2, UM at 576 cores and MOM at 80 cores and cice on ~16 cores. Looks like we will end up similar.)

CM3 prototpye is based on January OM3 build but will update soon.

Sea-ice cycle looks surprisingly good for this stage of development.

Complete CM3 configuration is a rose-cycl suite, MD to organise trying to moving to a private github repo.

OM3 - 1 degree

Micael is hoping to look at the proportion of cores between components, through automated scripting in OM3-utils. So far he has investigated increase core counts and how this impacts run time.

Scaling information on OM2:

ACCESS-NRI needs to be providing some scaling to provide with users who are submitting grants / doing runs etc

Initially we just need number of cores, how many SUs to do a run, and refer to ACCESS-OM2 Paper. AK to provide details used in previous NCMASS applications for a first pass at this information.

ESM1.6:

There are plans forming for updating ESM1.5 with newer model versions to create an ESM1.6. ESM1.6 is a fallback option for CMIP7 Fasttrack, which it looks like we won’t have ESM3 development sufficiently progressed to meet the timeline for. The main interest is updated WOMBAT, CABLE3 and possibly MOM5. However, WOMBAT development is being moved to the generic tracer framework. In theory, this can be used with ESM1.5/6, but some additional work will be required to get things working with the OASIS-MCT coupler. Is that high priority, or should the focus be ESM3?

DS to book a meeting with CSIRO stakeholders + wombat developers + AH.

CICE

AS has identifed that the area fields being used within CICE are inconsistent with the area fields used by MOM + NUOPC. They are calculated assuming square grid cells, which is not accurate for a round globe and especially problematic for the tripole. In the AUSCOM build of CICE5, code was added to read these areas from the grid file, however in CICE6 much code extra fields have been added to support the C-grid meaning making a new grid file could be come bloated. AS to talk to CICE-consortium/NCAR/NOAA about if they would use a loading this information from the MOM “supergrid” or an updated CICE grid file before implementing one option as a code change.

025 Degree profiling

  • Minghang is progressing with the 0.25 deg work. Minghang will have a go a make plots that are similar to the ones that Micael made for 1 degree.
  • Has trouble running latest OM3 build with <240 cores.
  • MOM init time is slow, needs investigating because it is impractically slow (10-20mins).
  • Performance about 10% better with land masking on.
  • Best performance is currently at 192 cores, needs investigating why it drops off soo much at higher core counts.

DS to book a meeting with MO, ML + anyone else interested to try and clarify and scope the best steps to profile efficient and finalise core counts.

Next meeting

15 May, 11:00AM AEST

Summary of today’s TWG - I didn’t catch everything so please edit to add/correct as needed.

Date: 2024-05-15

Attendees:

Dougie Squire (DS), Andrew Kiss (AK), Martin Dix (MD), Ezhil Kannadasan (EK), Micael Oliveira (MO), Anton Skeketee (AS), Andy Hogg (AH), Minghang Li (ML), Siobhan OFarrell (SO)

AK updates

  • @kial has taken close look at how water fluxes are managed in ACCESS-OM2 - very difficult to get water budget to close (in annual average) but it’s possible with enough care
  • AK has written up a document outlining how water balance works - will share on the forum
  • Need to do a much more careful job in ACCESS-OM3. Need detailed documentation and demonstrations on how to close budgets
  • AS: Will Hobbs and colleagues have been looking at freshwater fluxes out of sea-ice - there is confusion and apparent errors. SO: not quite true - they’ve been using the wrong variables. But there is one known error in CM2.
  • AK: does CM2 do any freshwater flux balancing? MD: no we don’t do anything to try and correct. SO: Initial checks showed things balanced okay. That may not be true for all runs
  • CMIP7 meeting this Friday. Let Andrew know if there are any updates

DS updates

  • ACCESS-OM3 inputs have been moved to vk83 and configs updated
  • We’ve decided to prioritise allowing generic tracers in ACCESS-OM2 and ESM1.5 since we’d like to use the generic WOMBAT code in ESM1.6
  • Turning on generic tracers in ACCESS-OM3 has no performance impact when no generic tracers are configured. So only need one exe for both non-bgc and bgc runs.
  • Should have repro CI on our configs soon. Once set up, we will make official request to become MOM node. Create fork in ACCESS-NRI org?
  • Hakaseh has confirmed issue with ice-to-ocean algae and nitrate fluxes in WOMBAT in ACCESS-OM2. Simple fix that Hakaseh has offered to implement.

ML updates

  • Showed 025 scaling plots that show that (ocean nodes) / (ice nodes) = 9 is a good choice. Will extend to larger core counts
  • MO: Should decide which partition to use - consider max core count and charge. Probably want to do this sooner rather than later
  • MO: Really should rebuild executable for different partitions. As simple as reconcretizing and building on node on target partition

EK updates

  • some recent update to WW3 is meaning that we can no longer generate mod_def.ww3 - investigating
  • 1deg crashing in Kara Sea. AK: Worth looking at whether OM3 is crashing in Kara Str. Had to apply Rayleigh damping in various locations in ACCESS-OM2 (Indonesian straits at 1°, Kara Strait at 0.25° and at 0.1°). Also should check out topog.

AS updates

  • Off to US for CICE meeting and 3 weeks of leave. DS will take over TWG organisation for this period.

Next meeting

5th June, 11:00AM AEST

1 Like

Here’s my attempt to summarise today’s TWG. Didn’t capture everything, so please add and correct as needed.

Date: 2024-06-05

Attendees:

DS:

  • attended MOM6 dev meeting

    • good that we now have a formal connection
    • Work is proceeding in multiple centres that we should be more closely paying attention to
    • much discussion on BGC
      • NCAR MARBL BGC model
        • ocean model agnostic
        • have been working on getting it working with MOM6
        • needed changes to NUOPC cap and a number of other places in MOM6 src
        • Mike Levy is leading this
        • PR imminent
        • changes overlap somewhat with Dougie’s generic tracer changes
        • they’ve had similar issues in handling BGC in vanishing layers, e.g. remineralising from sediment into vanished layer
    • COBALT - new project to overhaul to get COBALT v3 - still a generic tracer but will need MOM6 stuff to compile - so may need to include MOM6 code to run in MOM5
    • Angus told Bob we’re ready to become a MOM6 node
      • they’re happy for that to happen
      • they’re interested in tests with regional model - hopefully will happen once we have regression testing set up
      • we’re just waiting on a PR from release team to get testing working
      • AG - email from Marshall outlining process in getting on the MOM6 review team
    • Discussion of sinking schemes (cc @pearseb) - MOM6 sinking scheme is simple - only sink at a single rate - would like spatially variable sinking - generic tracer sinking is handled separately but still constant rate unless you implement yourself when updating sources.

Generic tracer wombat

  • running in OM3
  • trying to get running in ESM1.6
  • now running generic in ACCESS-OM2
  • a few differences, unclear if they are worrisome, will post on forum for feedback - main difference is detritus
  • Dougie must have misunderstood earlier conversation about WOMBAT virtual fluxes. Need to do virtual flux corrections to surface bgc fluxes to account for using salt restoring. Awkward because not possible without extending the generic tracer API to pass salt flux. Apparently not done in BLING, COBALT etc?

Using OM2 spack dev tools - mostly works well, some annoyances

We are nearly at the point of doing science parameter test runs with 0.25° ACCESS-OM3 config

ACCESS-OM2 BGC releases aren’t restart-reproducible (2x1day differs from 1x2day run) - only BGC tracers differ.

  • WOMBAT restarts - 2 files, one is for most tracers, the other for sediments, handled in different code sections - might be what is breaking repro?
  • Problem might go away when we use generic tracers for WOMBAT in ACCESS-OM2?

OM2 release plans