COSIMA TWG Meeting Minutes 2024

TWG summary from last week - a bit scrappy and incomplete so please fill in anything I missed.

Date: 2024-03-21

Attendees:

Performance scaling

ML - working on Micael’s performance tools, trying to reproduce results - 3 issues

  1. cesm driver fails to transfer some settings correctly to esmf - inconsistent with esmf docs - has worked out a workaround
  2. env vars in env section of config.yaml
  3. can’t run with >64 cores - hangs - cice problem?

suggests documenting these issues

MO: 1 a known issue - runconfig profiling settings ignored - need to use env vars

MO, AS: 3. a known problem in cice - can’t use >76 cice cores. Not a hard limit - due to a parameter setting - to do with not using roundrobin; not relevant since cice doesn’t scale to that core count anyway

MO: surface forcing the culprit for bad MOM6 scaling - specific to NUOPC cap; not seen in panan etc which use FMS - now trying to identify in more detail - a lot of load imbalance - worse if launching many jobs at once - an IO issue?? but nothing obviously IO related in code region. mom_surface_forcing file. Adding extra profiling regions. Trial and error.

DS: cap converts ESMF fields to MOM fields

AS: is it reading salt restoring

DS: looks like it - there are some salinity restoring io calls (see time_interp_external)

Model evaluation

AK: ENKF-C may be worth looking at for model-obs comparison

DS: Clothilde was planning to try this for eReefs - see how they went with it

Input directory structure

DS: issue with moving all inputs to vk83 for repro CI - how to structure it? Poll - vote! issue Move inputs to `vk83` · Issue #115 · COSIMA/access-om3 · GitHub

  • option A: version at top level
  • option B: version at innermost level

explicit full path specification for all individual input files in config.yaml

MO: sandbox 0.x.0 → 0.2.0 easier if versioned at top level but no strong pref

DS: linking version of input to version of exe - might be a pain if we want to do a lot of updates. But flipping could lead to a lot of versions that never really got used

AK: use symlinks?

AHeer: Kelsey say symlinks will burn you in the end. Flipped model (option B) is easier and clearer for users and doesn’t need symlinks. That’s what is being done and best for OM2 release

AH: let’s just go with flipped (option B) then, since no strong opinions

MO: sandbox could be useful for dev - some way to build test exes / configs to play with without doing a release - how we set things up for devs can be independent of how we do releases

AHeer: has to be on vk85 or tm70

AS: git-lfs ? each dev with their own fork?

AHeer: quickly run into file size limits with high res

DS: have to pay for lots of storage - not too expensive, maybe $5-10/mo for OM3 (without forcing files)

AHeer: try it out?

AH: happy to cover storage charges

AHeer: Or Tiledb - that does actual diffs on binary files (unlike git-lfs) - has a free version

AHeer: both manifests and git-lfs store hashes

DS: but git-lfs also stores revision history

AS: each file change doubles the storage for that file (doesn’t store deltas)

DS: could get very expensive - to investigate before deciding - actually probably unaffordable Slack

Namelist disussion: diabatic_first

DS: Namelist disussion: diabatic_first - we set to true - do we mind if we set it false (default) as it changes order of ops for generic tracers to be closer to MOM5-wombat - updates tracers in dynamic step

AH: don’t know why this is true

DS: setting comes from ncar - all our cosima mom6 configs and mom6-examples have it false

AH: will ask Marshall

AK: is it related to NUOPC cap?

DS: will check

Restart issue

EK: looking into restart file issue and looking at parameter and 0.25 restart - runs well except for restart

Next meeting

3 April, usual time