Hello all.
TLDR; I spent a long time chasing down a fault in the ACCESS-rAM3 EC grib reconfiguration task which is caused by changing the Intel compiler version used to build UMv13.5
Background : The CoE 21stCenturyWeather wished to obtain an updated version of ACCESS-rAM3 due to
- Its ability to use OSTIA SST input
- A required fix for convection rainfall initialisation for the GAL9 model
The latest branch used by Chermelle is: https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/u-by395_nci_access_ram3
I made a branch of this branch and merged my changes, which were confined to
- Building a UM executable with Sapphire Rapids optimisation
- I/O improvements
- Different domain definitions
The suite would fail at ec_um_recon
with
???????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 30
? Error from routine: RCF_RESET_DATA_SOURCE
? Error message: Section 0 Item 7 : Required field is not in input dump!
? Error from processor: 0
? Error number: 3
????????????????????????????????????????????????????????????????????????????????
Comparison b/w job outputs for the old and new tasks showed the list of OUTPUT fields in the log file were far more expansive in the new version. The error is caused by trying reconfigure new fields that aren’t in the original EC grib files.
The namelists loaded in both recon jobs were identical. The extra STASH variables the job wanted to use with the newer executable were primarily atmospheric tracers, e.g.
0 71 480000 1 33 74 11320 ATM TRACER 74 AFTER TS
0 71 480000 1 33 156 17142 WATER TRACER GRAUPEL (WTRAC%QGR)
So created this suite : https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot
which was a clone of u-by395/nci_access_ram3
.
I then began changing this suite from the standard config into the Flagship configuration one step at a time. You can see the revision log here: https://code.metoffice.gov.uk/trac/roses-u/log/b/y/3/9/5/Troubleshoot. To summarise:
- Set BUILD_MODE to new and compiled a new executable with sapphire rapids optimisation. The ec reconfiguration task and the 12km outer Lismore domain ran fine.
- Enabled OSTIA inputs. I cleaned the suite (i.e. removed all data in
~/cylc-run/u-by395
) and ran the suite again. It worked fine. - Then I changed to the Intel compiler used in the UMv13.0 Flagship. This reproduced the failure of the
ec_um_recon
detailed above i.e. asking for STASH variables in the output dump (primary atmospheric tracers) that weren’t present in the input dump.
The change is
#module load intel-compiler/2021.5.0
module load intel-compiler-llvm/2025.0.4
I’ve reverted back to intel-compiler/2021.5.0
and a suite running with a UMv13.5 built for Sapphire Rapids works fine.
Why did I use intel-compiler-llvm/2025.0.4
for the first flagship? Because NCI recommends it. See https://opus.nci.org.au/spaces/Help/pages/213942400/Sapphire+Rapids+Compute+Nodes
We recommend using the latest Intel LLVM Compiler for these nodes (currently 2023.0.0, check for updated versions installed on Gadi with
module avail intel-compiler-llvm
) with options to build your code for use on all nodes in Gadi.
Why a UM reconfiguration executable compiled with different Fortran compilers will ask for such different STASH outputs is a mystery to me.
Has anyone ever encountered similar behaviour before?
To be honest I still can’t quite believe that this was the issue, so if anyone wants to check my sanity, you can run this suite: https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot?rev=333075 ,
and this one : https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot?rev=333077 ,
and see if you get the same answers.
If people are interested, I kept all the build and job logs for fcm_make_um
,fcm_make2_um
and ec_um_recon
for those two suites. I haven’t done a detailed comparison yet. It might make for interesting reading.
But on the other hand, given the limited lifespan of the UM, is it worth investigating further?