Output STASH fields for reconfiguration change depending on Intel compiler version used to build UMv13.5

Hello all.

TLDR; I spent a long time chasing down a fault in the ACCESS-rAM3 EC grib reconfiguration task which is caused by changing the Intel compiler version used to build UMv13.5

Background : The CoE 21stCenturyWeather wished to obtain an updated version of ACCESS-rAM3 due to

  • Its ability to use OSTIA SST input
  • A required fix for convection rainfall initialisation for the GAL9 model

The latest branch used by Chermelle is: https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/u-by395_nci_access_ram3

I made a branch of this branch and merged my changes, which were confined to

  • Building a UM executable with Sapphire Rapids optimisation
  • I/O improvements
  • Different domain definitions

The suite would fail at ec_um_recon with

???????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 30
?  Error from routine: RCF_RESET_DATA_SOURCE
?  Error message: Section   0 Item     7 : Required field is not in input dump!
?  Error from processor: 0
?  Error number: 3
????????????????????????????????????????????????????????????????????????????????

Comparison b/w job outputs for the old and new tasks showed the list of OUTPUT fields in the log file were far more expansive in the new version. The error is caused by trying reconfigure new fields that aren’t in the original EC grib files.

The namelists loaded in both recon jobs were identical. The extra STASH variables the job wanted to use with the newer executable were primarily atmospheric tracers, e.g.

0   71  480000   1    33    74 11320  ATM TRACER 74               AFTER TS
0   71  480000   1    33   156 17142  WATER TRACER GRAUPEL (WTRAC%QGR)

So created this suite : https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot

which was a clone of u-by395/nci_access_ram3.

I then began changing this suite from the standard config into the Flagship configuration one step at a time. You can see the revision log here: https://code.metoffice.gov.uk/trac/roses-u/log/b/y/3/9/5/Troubleshoot. To summarise:

  1. Set BUILD_MODE to new and compiled a new executable with sapphire rapids optimisation. The ec reconfiguration task and the 12km outer Lismore domain ran fine.
  2. Enabled OSTIA inputs. I cleaned the suite (i.e. removed all data in ~/cylc-run/u-by395) and ran the suite again. It worked fine.
  3. Then I changed to the Intel compiler used in the UMv13.0 Flagship. This reproduced the failure of the ec_um_recon detailed above i.e. asking for STASH variables in the output dump (primary atmospheric tracers) that weren’t present in the input dump.

The change is

 #module load intel-compiler/2021.5.0
module load intel-compiler-llvm/2025.0.4

I’ve reverted back to intel-compiler/2021.5.0 and a suite running with a UMv13.5 built for Sapphire Rapids works fine.

Why did I use intel-compiler-llvm/2025.0.4 for the first flagship? Because NCI recommends it. See https://opus.nci.org.au/spaces/Help/pages/213942400/Sapphire+Rapids+Compute+Nodes

We recommend using the latest Intel LLVM Compiler for these nodes (currently 2023.0.0, check for updated versions installed on Gadi with module avail intel-compiler-llvm ) with options to build your code for use on all nodes in Gadi.

Why a UM reconfiguration executable compiled with different Fortran compilers will ask for such different STASH outputs is a mystery to me.

Has anyone ever encountered similar behaviour before?

To be honest I still can’t quite believe that this was the issue, so if anyone wants to check my sanity, you can run this suite: https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot?rev=333075 ,
and this one : https://code.metoffice.gov.uk/trac/roses-u/browser/b/y/3/9/5/Troubleshoot?rev=333077 ,
and see if you get the same answers.

If people are interested, I kept all the build and job logs for fcm_make_um ,fcm_make2_um and ec_um_recon for those two suites. I haven’t done a detailed comparison yet. It might make for interesting reading.

But on the other hand, given the limited lifespan of the UM, is it worth investigating further?

1 Like

Check the namelists of your two reconfiguration runs in their work directories to make sure they were actually running the same config. I would be very surprised if the compiler choice made an impact.

Ok I’ve copied ~/cylc-run/u-by395/work/20220226T0000Z/ec_um_recon_000/for both configurations and I’ve compared them.

All input ASCII files are identical.

A quick iris scripts says the atmos.recontmp files in both directories are also identical.