TWG summary from last week - a bit scrappy and incomplete so please fill in anything I missed.
Date: 2024-03-21
Attendees:
- Anton Steketee (AS) @anton
- Andrew Kiss (AK) @aekiss
- Andy Hogg (AH) @AndyHoggANU
- Aidan Heerdegen (AHeer) @Aidan
- Ezhil Kannadasan (EK) @ezhilsabareesh8
- Dougie Squire (DS) @dougiesquire
- Micael Oliveira (MO) @micael
- Martin Dix (MD) @MartinDix
- Minghang Li (ML) @minghangli
Performance scaling
ML - working on Micael’s performance tools, trying to reproduce results - 3 issues
- cesm driver fails to transfer some settings correctly to esmf - inconsistent with esmf docs - has worked out a workaround
- env vars in env section of config.yaml
- can’t run with >64 cores - hangs - cice problem?
suggests documenting these issues
MO: 1 a known issue - runconfig profiling settings ignored - need to use env vars
MO, AS: 3. a known problem in cice - can’t use >76 cice cores. Not a hard limit - due to a parameter setting - to do with not using roundrobin; not relevant since cice doesn’t scale to that core count anyway
MO: surface forcing the culprit for bad MOM6 scaling - specific to NUOPC cap; not seen in panan etc which use FMS - now trying to identify in more detail - a lot of load imbalance - worse if launching many jobs at once - an IO issue?? but nothing obviously IO related in code region. mom_surface_forcing file. Adding extra profiling regions. Trial and error.
DS: cap converts ESMF fields to MOM fields
AS: is it reading salt restoring
DS: looks like it - there are some salinity restoring io calls (see time_interp_external
)
Model evaluation
AK: ENKF-C may be worth looking at for model-obs comparison
DS: Clothilde was planning to try this for eReefs - see how they went with it
Input directory structure
DS: issue with moving all inputs to vk83 for repro CI - how to structure it? Poll - vote! issue Move inputs to `vk83` · Issue #115 · COSIMA/access-om3 · GitHub
- option A: version at top level
- option B: version at innermost level
explicit full path specification for all individual input files in config.yaml
MO: sandbox 0.x.0 → 0.2.0 easier if versioned at top level but no strong pref
DS: linking version of input to version of exe - might be a pain if we want to do a lot of updates. But flipping could lead to a lot of versions that never really got used
AK: use symlinks?
AHeer: Kelsey say symlinks will burn you in the end. Flipped model (option B) is easier and clearer for users and doesn’t need symlinks. That’s what is being done and best for OM2 release
AH: let’s just go with flipped (option B) then, since no strong opinions
MO: sandbox could be useful for dev - some way to build test exes / configs to play with without doing a release - how we set things up for devs can be independent of how we do releases
AHeer: has to be on vk85 or tm70
AS: git-lfs ? each dev with their own fork?
AHeer: quickly run into file size limits with high res
DS: have to pay for lots of storage - not too expensive, maybe $5-10/mo for OM3 (without forcing files)
AHeer: try it out?
AH: happy to cover storage charges
AHeer: Or Tiledb - that does actual diffs on binary files (unlike git-lfs) - has a free version
AHeer: both manifests and git-lfs store hashes
DS: but git-lfs also stores revision history
AS: each file change doubles the storage for that file (doesn’t store deltas)
DS: could get very expensive - to investigate before deciding - actually probably unaffordable Slack
Namelist disussion: diabatic_first
DS: Namelist disussion: diabatic_first - we set to true - do we mind if we set it false (default) as it changes order of ops for generic tracers to be closer to MOM5-wombat - updates tracers in dynamic step
AH: don’t know why this is true
DS: setting comes from ncar - all our cosima mom6 configs and mom6-examples have it false
AH: will ask Marshall
AK: is it related to NUOPC cap?
DS: will check
Restart issue
EK: looking into restart file issue and looking at parameter and 0.25 restart - runs well except for restart
Next meeting
3 April, usual time