ACCESS-rAM3 Release 1.0 Feedback

About

This topic is a catch-all location for feedback for the ACCESS-rAM3 Release 1.0.

Please reply to this topic if you have feedback on the ACCESS-rAM3 release. Feedback can be to point out problems encountered, or positive to highlight what worked well.

If your feedback is involved and will require specific help feel free to
make a topic on the ACCESS-Hive Forum.

If you’re not sure, reply here and your query can be moved to a separate topic if required.

Thanks NRI for the new release, congrats on the improvements implemented. I’ve just successfully completed a test of the latest branch access_rel_1 for the ancil and running suites.

One thing I hadn’t fully understood the implications was the new netcdf functionality (as I had been working from my own branch). I know this is a complex issue, and we asked for netcdf support, but I think there is scope to improve some of this. For example:

1. Variable name issue

Previously loading files with iris on inspection of an output file (e.g umnsaa_pvera) we could see almost all variable names as simple text (e.g. “air_temperature”).
Now on inspection we only get a list of stashIDs (e.g. “STASH_m01s03i236”) as variable names. This makes it difficult to navigate data, especially for new users.

2. Dimension issue

Previously when loading a cube with iris and converting to xarray, metadata would come across clearly (e.g. the array would be called “air_temperature”, with dimensions [“time”, “latitude”, “longitude”].
Now metadata is converted with non-standard names (e.g. time is “VT1HR”, latitude is “grid_latitude_t” and longitude is “grid_longitude_t”).
Perhaps there are good reasons not to simply rename everything as “time” if they are defined differently (e.g. instantaneous on the hour, hour average, or max in the hour), but this is a change to previous functionality, so should be documented. Personally, I would prefer “time” as a standard coordinate name, with the method detail included in attributes.

3. Time profile issue

Previously where multiple time profiles were defined for a single variable (e.g. by default for air temperature in STASHPACK1 there is VTS1 and VT1HR: for first model timestep and instantanous on the hour), iris would concatenate these into a single cube.
Now with multiple time profile for a particular variable, multiple variables are created (e.g. for air temperature there is “STASH_m01s03i236” and “STASH_m01s03i236_2”), making it more difficult again to navigate. In general it is the _2 ones that I want to use (instantaneous on the hour, rather than the first timestep output).

I know there are no quick solutions to the above, and the UM systems don’t make it easy. For example… I don’t know why the standard UM STASHPACK1 (for as long as I can remember) has set a single value output at the first timestep for each variable. In free-running mode this means every cycle has to have that first timestep removed (or there are two values at the same time). Perhaps the Met Office included this because of some verification (i.e. with RES). But we could consider removing those single timestep outputs, seeing as they cause issue with the new netcdf functionality.

I also know that designing output workflows (e.g. for AUS2200 or BARRA or ESM) takes a huge amount of development effort (so thank you). I don’t know whether the standard STASHPACK1 is useful for the ACCESS community (would need discussion), but I think future efforts should focus on adapting the work from AUS2200 to make a useful set of outputs commonly used by researchers (as opposed to the NWP validation STASHPACKS that come from the Met Office).

Perhaps agreeing on better output workflows and/or stashpacks could be part of an ongoing discussion at the Atmosphere working groups, and a breakout session at the upcoming ACCESS annual workshop?

Examples of what I’m describing is included below:

import iris
import xarray as xr
import matplotlib.pyplot as plt

xr.set_options(display_max_rows=50)
datapath = '/scratch/fy29/mjl561/cylc-run/u-dg768/share/cycle/20220226T0000Z/Lismore/d0198/RAL3P2/um'

#### 1. VARIABLE NAME ISSUE

# PREVIOUSLY with iris
cbs = iris.load(f'{datapath}/umnsaa_pvera000')
print(cbs)
 
'''
0: m01s01i202 / (unknown)              (time: 25; latitude: 450; longitude: 450)
1: Turbulent mixing height after boundary layer / (m) (time: 25; latitude: 450; longitude: 450)
2: m01s03i253 / (unknown)              (time: 24; latitude: 450; longitude: 450)
3: Cumulus capped boundary layer indicator / (1) (time: 24; latitude: 450; longitude: 450)
4: air_temperature / (K)               (time: 25; latitude: 450; longitude: 450)
5: atmosphere_boundary_layer_thickness / (m) (time: 8; latitude: 450; longitude: 450)
6: dew_point_temperature / (K)         (time: 25; latitude: 450; longitude: 450)
7: fog_area_fraction / (1)             (time: 25; latitude: 450; longitude: 450)
8: land_binary_mask / (1)              (latitude: 450; longitude: 450)
9: relative_humidity / (%)             (time: 25; latitude: 450; longitude: 450)
10: sea_ice_area_fraction / (1)         (time: 24; latitude: 450; longitude: 450)
11: surface_air_pressure / (Pa)         (time: 8; latitude: 450; longitude: 450)
12: surface_altitude / (m)              (latitude: 450; longitude: 450)
13: surface_downwelling_longwave_flux_in_air / (W m-2) (time: 24; latitude: 450; longitude: 450)
14: surface_downwelling_shortwave_flux_in_air / (W m-2) (time: 24; latitude: 450; longitude: 450)
15: surface_net_downward_longwave_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
16: surface_snow_amount / (kg m-2)      (time: 24; latitude: 450; longitude: 450)
17: surface_temperature / (K)           (time: 25; latitude: 450; longitude: 450)
18: surface_upward_latent_heat_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
19: surface_upward_sensible_heat_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
20: toa_incoming_shortwave_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
21: toa_outgoing_longwave_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
22: toa_outgoing_shortwave_flux / (W m-2) (time: 25; latitude: 450; longitude: 450)
23: toa_outgoing_shortwave_flux / (W m-2) (time: 24; latitude: 450; longitude: 450)
24: visibility_in_air / (m)             (time: 25; latitude: 450; longitude: 450)
25: wind_speed_of_gust / (m s-1)        (time: 24; latitude: 450; longitude: 450)
26: x_wind / (m s-1)                    (time: 25; latitude: 451; longitude: 450)
27: y_wind / (m s-1)                    (time: 25; latitude: 451; longitude: 450)
'''


# NOW with netcdf
ds = xr.open_dataset(f'{datapath}/nc/umnsaa_pvera000.nc')
print(ds)

'''
<xarray.Dataset> Size: 506MB
Dimensions:                     (grid_longitude_t: 450, grid_latitude_t: 450,
                                 bounds2: 2, VT1HR: 12, VTS0_0: 1, VT3HR: 4,
                                 VTS1: 1, VTS1_rad_diag: 1, VT1HR_rad_diag: 11,
                                 grid_longitude_uv: 450, grid_latitude_uv: 451,
                                 height_10m: 1, height_1_5m: 1, VT1HRMAX: 12)
Coordinates:
  * grid_longitude_t            (grid_longitude_t) float64 4kB 148.8 ... 157.7
    longitude_t                 (grid_latitude_t, grid_longitude_t) float64 2MB ...
  * grid_latitude_t             (grid_latitude_t) float64 4kB -32.95 ... -24.06
    latitude_t                  (grid_latitude_t, grid_longitude_t) float64 2MB ...
  * VT1HR                       (VT1HR) datetime64[ns] 96B 2022-02-26T01:00:0...
  * VTS0_0                      (VTS0_0) datetime64[ns] 8B 2022-02-26
  * VT3HR                       (VT3HR) datetime64[ns] 32B 2022-02-26T03:00:0...
  * VTS1                        (VTS1) datetime64[ns] 8B 2022-02-26T00:01:00
  * VTS1_rad_diag               (VTS1_rad_diag) datetime64[ns] 8B 2022-02-26T...
  * VT1HR_rad_diag              (VT1HR_rad_diag) datetime64[ns] 88B 2022-02-2...
  * grid_longitude_uv           (grid_longitude_uv) float64 4kB 148.8 ... 157.7
    longitude_uv                (grid_latitude_uv, grid_longitude_uv) float64 2MB ...
  * grid_latitude_uv            (grid_latitude_uv) float64 4kB -32.96 ... -24.05
    latitude_uv                 (grid_latitude_uv, grid_longitude_uv) float64 2MB ...
  * height_10m                  (height_10m) float64 8B 10.0
  * height_1_5m                 (height_1_5m) float64 8B 1.5
  * VT1HRMAX                    (VT1HRMAX) datetime64[ns] 96B 2022-02-26T01:0...
Dimensions without coordinates: bounds2
Data variables:
    rotated_latitude_longitude  |S1 1B ...
    grid_longitude_t_bounds     (grid_longitude_t, bounds2) float64 7kB ...
    grid_latitude_t_bounds      (grid_latitude_t, bounds2) float64 7kB ...
    STASH_m01s00i023            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s00i024            (VTS0_0, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s00i024_2          (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s00i025            (VT3HR, grid_latitude_t, grid_longitude_t) float64 6MB ...
    STASH_m01s00i030            (VTS0_0, grid_latitude_t, grid_longitude_t) int32 810kB ...
    STASH_m01s00i031            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s00i033            (VTS0_0, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s00i409            (VT3HR, grid_latitude_t, grid_longitude_t) float64 6MB ...
    STASH_m01s01i202            (VTS1, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s01i202_2          (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s01i205            (VTS1, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s01i205_2          (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s01i207_2          (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s01i208            (VTS1_rad_diag, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s01i208_2          (VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) float64 18MB ...
    STASH_m01s01i235            (VTS1_rad_diag, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s01i235_2          (VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) float64 18MB ...
    STASH_m01s02i201            (VTS1_rad_diag, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s02i201_2          (VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) float64 18MB ...
    STASH_m01s02i205            (VTS1_rad_diag, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s02i205_2          (VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) float64 18MB ...
    STASH_m01s02i207            (VTS1_rad_diag, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s02i207_2          (VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) float64 18MB ...
    STASH_m01s03i217            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    grid_longitude_uv_bounds    (grid_longitude_uv, bounds2) float64 7kB ...
    grid_latitude_uv_bounds     (grid_latitude_uv, bounds2) float64 7kB ...
    STASH_m01s03i225            (VTS1, height_10m, grid_latitude_uv, grid_longitude_uv) float64 2MB ...
    STASH_m01s03i225_2          (VT1HR, height_10m, grid_latitude_uv, grid_longitude_uv) float64 19MB ...
    STASH_m01s03i226            (VTS1, height_10m, grid_latitude_uv, grid_longitude_uv) float64 2MB ...
    STASH_m01s03i226_2          (VT1HR, height_10m, grid_latitude_uv, grid_longitude_uv) float64 19MB ...
    STASH_m01s03i234            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i236            (VTS1, height_1_5m, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i236_2          (VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i245            (VTS1, height_1_5m, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i245_2          (VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i248            (VTS1, height_1_5m, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i248_2          (VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i250            (VTS1, height_1_5m, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i250_2          (VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i253            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i281            (VTS1, height_1_5m, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i281_2          (VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i304            (VTS1, grid_latitude_t, grid_longitude_t) float64 2MB ...
    STASH_m01s03i304_2          (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    STASH_m01s03i310            (VT1HR, grid_latitude_t, grid_longitude_t) float64 19MB ...
    VT1HRMAX_bounds             (VT1HRMAX, bounds2) datetime64[ns] 192B ...
    STASH_m01s03i463            (VT1HRMAX, grid_latitude_t, grid_longitude_t) float64 19MB ...
Attributes:
    Conventions:  CF-1.6
    source:       Met Office Unified Model v13.0

'''        

##### 2. DEMENSION ISSUE #####
# PREVIOUSLY with iris

cb = iris.load_cube(f'{datapath}/umnsaa_pvera000', 'air_temperature')
da = xr.DataArray().from_iris(cb)
print(da)

'''
<xarray.DataArray 'air_temperature' (time: 25, latitude: 450, longitude: 450)> Size: 20MB
dask.array<filled, shape=(25, 450, 450), dtype=float32, chunksize=(1, 450, 450), chunktype=numpy.ndarray>
Coordinates:
  * time                     (time) datetime64[ns] 200B 2022-02-26T00:01:00 ....
  * latitude                 (latitude) float32 2kB -32.96 -32.94 ... -24.06
  * longitude                (longitude) float32 2kB 148.8 148.9 ... 157.7 157.7
    forecast_reference_time  datetime64[ns] 8B ...
    height                   float64 8B ...
    forecast_period          (time) timedelta64[ns] 200B ...
Attributes:
    standard_name:  air_temperature
    units:          K
    source:         Data from Met Office Unified Model
    um_version:     13.0
    STASH:          m01s03i236
'''

# NOW with netcdf
ds = xr.open_dataset(f'{datapath}/nc/umnsaa_pvera000.nc')
da = ds['STASH_m01s03i236_2'] # note the underscore because of two clashing time profiles
print(da)

'''
<xarray.DataArray 'STASH_m01s03i236_2' (VT1HR: 12, height_1_5m: 1,
                                        grid_latitude_t: 450,
                                        grid_longitude_t: 450)> Size: 19MB
[2430000 values with dtype=float64]
Coordinates:
  * grid_longitude_t  (grid_longitude_t) float64 4kB 148.8 148.9 ... 157.7 157.7
    longitude_t       (grid_latitude_t, grid_longitude_t) float64 2MB ...
  * grid_latitude_t   (grid_latitude_t) float64 4kB -32.95 -32.94 ... -24.06
    latitude_t        (grid_latitude_t, grid_longitude_t) float64 2MB ...
  * VT1HR             (VT1HR) datetime64[ns] 96B 2022-02-26T01:00:00 ... 2022...
  * height_1_5m       (height_1_5m) float64 8B 1.5
Attributes:
    long_name:          TEMPERATURE AT 1.5M
    standard_name:      air_temperature
    units:              K
    cell_methods:       VT1HR: point
    grid_mapping:       rotated_latitude_longitude
    um_version:         13.0
    um_stash_source:    m01s03i236
    packing_method:     quantization
    precision_measure:  binary
    precision_value:    -6
'''

##### 3. TIME PROFILE ISSUE #####

# PREVIOUSLY with iris, which concatenates the time profiles in a cycle with glob
cb = iris.load_cube(f'{datapath}/umnsaa_pvera*', 'air_temperature')

# NOW with netcdf, we need to know the stash name, and whether there is a clash in time profiles

ds1 = xr.open_dataset(f'{datapath}/nc/umnsaa_pvera000.nc')['STASH_m01s03i236_2']
ds2 = xr.open_dataset(f'{datapath}/nc/umnsaa_pvera012.nc')['STASH_m01s03i236_2']
ds = xr.concat([ds1, ds2], dim='VT1HR')

# and then to simplify, rename dimensions and drop the unimportant ones
ds = ds.rename({'VT1HR': 'time', 'grid_latitude_t': 'latitude', 'grid_longitude_t': 'longitude'})
ds = ds.drop_vars(['height_1_5m', 'longitude_t', 'latitude_t'])
2 Likes

Hi Mathew,
Thanks for the feedback on UM NetCDF. The NetCDF support is currently based on UM NetCDF as documented at UMDP C11. This uses a translation from STASH to NetCDF CF that was based on Metarelate, as documented in UM Trac ticket #3370. In that Trac ticket I comment on the fact that the mapping is out of date since ​Iris is now independent from Metarelate. This is one reason that we were asking for feedback on the NetCDF output. I will create a UM Trac ticket asking for the mappings in STASH_to_CF.txt to depend directly on IRIS.

This, of course will not directly address the variable name issue that you have reported. The problem is caused by the variable name being mapped to the STASH source um_stash_source rather than the standard_name. If you look at the output such as share/cycle/20220227T0000Z/Lismore/d1000/GAL9/um/umnsaa_pvera000.nc using ncdump -h you will see variables like:

	double STASH_m01s03i236_2(VT1HR, height_1_5m, grid_latitude_t, grid_longitude_t) ;
		STASH_m01s03i236_2:long_name = "TEMPERATURE AT 1.5M" ;
		STASH_m01s03i236_2:standard_name = "air_temperature" ;
		STASH_m01s03i236_2:units = "K" ;
		STASH_m01s03i236_2:coordinates = "longitude_t latitude_t" ;
		STASH_m01s03i236_2:cell_methods = "VT1HR: point" ;
		STASH_m01s03i236_2:grid_mapping = "rotated_latitude_longitude" ;
		STASH_m01s03i236_2:_FillValue = -1073741824. ;
		STASH_m01s03i236_2:um_version = "13.0" ;
		STASH_m01s03i236_2:um_stash_source = "m01s03i236" ;
		STASH_m01s03i236_2:packing_method = "quantization" ;
		STASH_m01s03i236_2:precision_measure = "binary" ;
		STASH_m01s03i236_2:precision_value = -6 ;

One problem with using the standard_name as the variable name is that it may not be unique in the file, so the variable name would need to be modified to distinguish variables. In this case, both STASH_m01s03i236 and STASH_m01s03i236_2 have standard_name "air_temperature". If standard_name was used instead, you might have (e.g.) variable names air_temperature and air_temperature_2 instead.

Perhaps you could suggest a better specification for the mapping from STASH fields to NetCDF CF variables than the one that is currently being used by UM NetCDF?

I would also be interested to know what happens when you use Iris instead of Xarray to load the NetCDF output files. Would that help?

In any case, should we also provide optional STASHpacks that revert to using the previous format?

1 Like

See MOSRS UM ticket #7930 for the UM to NetCDF CF translation tables.

@rbeucher How would existing MED tools help users of ACCESS-rAM3 to evaluate the UM NetCDF output of the u-dg768 Regional Nesting Suite?

CSET is the latest version of the regional evaluation suite I believe - CSET documentation

2 Likes

I’m not very familiar with the model’s outputs, but I suppose this is close to ESM1.6. I’d expect ILAMB should be able to handle data from the UM. We can also explore some ESMValTool diagnostics once the data has been CMORised. If you can share an example output, we can start drafting a few examples. The main thing we’ll need is a clear mapping of the variables to their CMIP equivalents. Tagging @RhaegarZeng

I agree with Romain — a sample output along with a clear variable mapping would be very helpful for further developing

Thanks Paul for your reply.

I would also be interested to know what happens when you use Iris instead of Xarray to load the NetCDF output files. Would that help?

Iris does load the netcdf with readable variable names. I do currently use iris in my workflow, but I assumed a big draw for netcdf outputs would be so people can use tools they are much more familiar with like xarray and ncview. This solution would still require most users to convert from iris to xarray and then to netcdf, which negates a big reason to convert to netcdf in the suite.

Perhaps you could suggest a better specification for the mapping from STASH fields to NetCDF CF variables than the one that is currently being used by UM NetCDF?

um2nc.py renamed known variables to CMIP short names. Again, not ideal, just different. But CMIP short names are somewhat more understandable than stash codes.

Another suggestion for mapping to variable names (and my preference) is simply human readable forms as you suggested: air_temperature and air_temperature_2 etc. based on the standard_name, along with a simple way for users to add additional mapping (because we typically need to customise stashpacks). But I’m just one voice, and I feel this output discussion is a big change that affects all users and workflows, so would need a greater amount of consultation, testing and feedback (which I know you already called for). So until then I would suggest reverting to the previous format until we have broader community agreement (perhaps through the atmosphere WG?).

@mlipson could you copy a representative data file to /scratch/public that we could use as a common reference, then we can look at how the data might be accessed in a convenient manner, e.g. the cf-xarray package allows for data to be accessed by standard_name

Using cf-xarray is arguably a good idea anyway, as it means you can write much more model agnostic analyses, e.g.

Hi @Aidan, a high-level readable netcdf format is needed for end-users, not just an in-analysis solution. In my opinion, if files can’t be accessed quickly and easily human-readably e.g. from the command-line with ncdump or opened with ncview then it is not a sufficient solution - with the risk being that people won’t use the model.

1 Like

Hi Aiden,

I’ve copied representative output data from an unmodified ACCESS-rAM suite to /scratch/public/mjl561. This is a single day’s output for one type of file (umnsaa_pvera), both netcdf (current) and pp (previous) versions.

Doing this I also see the netcdf is approximately twice the size of the pp, indicating the netcdfs could be optimised with compression/dtype selection.

2 Likes

Thanks for the feedback @bethanwhite. We’ll discuss this and let you know how we can address this in future releases.

In the mean-time I’ll try and provide some suggestions about how the current output format can be used most effectively.

1 Like

Thanks. I’ve made a copy.

The netCDF files are compressed, and the chunk sizes seem reasonable (from a compression point of view). They do have metadata about quantisation and precision, I suppose these are inherited from the pp files as they are type double, and don’t appear to have any packing applied.

        double STASH_m01s02i207_2(VT1HR_rad_diag, grid_latitude_t, grid_longitude_t) ;
                STASH_m01s02i207_2:long_name = "DOWNWARD LW RAD FLUX: SURFACE" ;
                STASH_m01s02i207_2:standard_name = "surface_downwelling_longwave_flux_in_air" ;
                STASH_m01s02i207_2:units = "W m-2" ;
                STASH_m01s02i207_2:coordinates = "longitude_t latitude_t" ;
                STASH_m01s02i207_2:cell_methods = "VT1HR_rad_diag: point" ;
                STASH_m01s02i207_2:grid_mapping = "rotated_latitude_longitude" ;
                STASH_m01s02i207_2:_FillValue = -1073741824. ;
                STASH_m01s02i207_2:um_version = "13.0" ;
                STASH_m01s02i207_2:um_stash_source = "m01s02i207" ;
                STASH_m01s02i207_2:packing_method = "quantization" ;
                STASH_m01s02i207_2:precision_measure = "binary" ;
                STASH_m01s02i207_2:precision_value = -6 ;
                STASH_m01s02i207_2:_Storage = "chunked" ;
                STASH_m01s02i207_2:_ChunkSizes = 1, 450, 450 ;
                STASH_m01s02i207_2:_DeflateLevel = 1 ;
                STASH_m01s02i207_2:_Endianness = "little" ;

Is this what you were referring to?

Hi rAM3 team.

My rAM3 development suite has a mismatch b/w the Open MPI versions used to compile the UM executable and run the executable. The suite threw a random MPI error last week, along the lines of

(error message)
[gadi-cpu-clx-1971:2726498:0:2726498] ib_mlx5_log.c:177 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-1971:2726498:0:2726498] ib_mlx5_log.c:177 DCI QP 0x1c476 wqe[254]: RDMA_READ s-- [rqpn 0x1b85c rlid 3656] [rva 0x145a8ca221c0 rkey 0x9b864] [va 0x1519ee3dd000 len 409568 lkey 0x7692f]
[gadi-cpu-clx-1975:1818122:0:1818122] ib_mlx5_log.c:177 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-1975:1818122:0:1818122] ib_mlx5_log.c:177 DCI QP 0xa343 wqe[493]: RDMA_READ s-- [rqpn 0x50c3 rlid 1798] [rva 0x1533f2e30900 rkey 0x11ad7f] [va 0x152529902000 len 530464 lkey 0xfcdac]
==== backtrace (tid:1818122) ====
0 0x0000000000025da7 uct_ib_mlx5_completion_with_err() ???:0
1 0x00000000000555ea uct_dc_mlx5_ep_handle_failure() ???:0
2 0x0000000000027a85 uct_ib_mlx5_check_completion() ???:0
3 0x0000000000058157 uct_dc_mlx5_iface_progress_ll() :0
4 0x000000000004f7da ucp_worker_progress() ???:0
5 0x00000000000aadcd mca_pml_ucx_send_nbr() /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/ompi/mca/pml/ucx/pml_ucx.c:928
6 0x00000000000aadcd mca_pml_ucx_send_nbr() /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/ompi/mca/pml/ucx/pml_ucx.c:928
7 0x00000000000aadcd mca_pml_ucx_send() /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/source/openmpi-4.1.5/ompi/mca/pml/ucx/pml_ucx.c:949
8 0x0000000000206783 PMPI_Send() /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/build/gcc/ompi/psend.c:81
9 0x0000000000051a38 ompi_send_f() /jobfs/78105093.gadi-pbs/0/openmpi/4.1.5/build/intel/ompi/mpi/fortran/mpif-h/profile/psend_f.c:78
10 0x000000000338acc6 mpl_send_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-gcom-7.8-fqtyhulqaxiec6xgueujomciqzivyv3d/spack-src/preprocess/src/gcom/mpl/mpl_send.F90:63
11 0x000000000072acfd multiple_variables_halo_exchange_mp_swap_bounds_mv_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/control/mpp/multiple_variables_halo_exchange.F90:599
12 0x00000000013732e2 eg_diff_ctl_mod_mp_eg_diff_ctl_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/atmosphere/diffusion_and_filtering/eg_diff_ctl.F90:367
13 0x0000000000c56a2b atm_step_4a_mod_mp_atm_step_4a_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/control/top_level/atm_step_4A.F90:2995
14 0x00000000004f0894 u_model_4a_mod_mp_u_model_4a_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/control/top_level/u_model_4A.F90:386
15 0x000000000040ca38 um_shell_mod_mp_um_shell_() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/control/top_level/um_shell.F90:748
16 0x00000000004093f8 MAIN__() /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/spack-src/../spack-build/preprocess-atmos/src/um/src/control/top_level/um_main.F90:60
17 0x00000000004093a2 main() ???:0
18 0x000000000003a7e5 __libc_start_main() ???:0
19 0x00000000004092ae _start() ???:0
=================================

NCI provided the following feedback.

In the /home/548/pag548/log.20250703T070947Z/job/20220227T0000Z/Flagship_ERA5to1km_12km_GAL9_um_fcst_003/NN/job.out file, I can see like this:
… [INFO] mpirun -n 960 --map-by node:PE=2 --rank-by core um-atmos.exe [1751533088.386940] [gadi-cpu-clx-2123:2724181:0] ucc_context.c:399 UCC ERROR failed to create tl context for nccl …

When one version of Open MPI is used for creating the binary and another version of Open MPI is used for executing the binary, this type of error may happen. We have seen this type of error before.

We can see at /home/548/pag548/log.20250703T070947Z/job/20220227T0000Z/Flagship_ERA5to1km_12km_GAL9_um_fcst_003/NN/job.err that you are using openmpi/4.1.4 for executing the binary.

ldd /g/data/vk83/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v4/intel-19.0.3.199/um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw/build-atmos/bin/um-atmos.exe | grep mpi command shows that you have created the binary with openmpi/4.1.5.

I’m not sure if this mismatch between the Open MPI is in the official release (I’m using a branch that doesn’t use netCDF outputs), but if it is, it might be worth investigating for a future release.

2 Likes

Thanks for reporting this @Paul.Gregory. We’ll look into it and report back.

1 Like

Hi @Paul.Gregory . I would like to try to reproduce your problem. Which suite are you using? Will I need to join any projects to be able to read input files (e.g. gb02)?

Hi @paulleopardi

The suite is u-dq126. @cbengel created the branch ‘ff_output’ for me. This is the closest revision that generated the error: https://code.metoffice.gov.uk/trac/roses-u/browser/d/q/1/2/6/ff_output?rev=325393

You will only need gb02 membership to access the ROMIO_HINTS file
/g/data/gb02/public/io_hints.txt which contains

### Striping to match input file striping for next run
striping_factor 8
striping_unit 5242880
cb_nodes 8
cb_buffer_size 8388608

I can grant you gb02 membership, or you can edit the site/nci-gadi/suite-adds.rc and put the io_hints.txt file in another directory.

I have copied the log files of the jobs in question to my home directory at ~pag548/log.20250703T070947Z. I can copy them to common location (/scratch/vk83 ?) if you want to read them.

Note revision@325393 has lower values of NPROCY/X for the 12km and 5 km domains when the error was generated, but the 1km domain values (where the error was created) are the same.

Here are the contents of the NCI help request on July 3.

Hello NCI

I’ve generated an interesting error while running a Unified Model atmospheric forecast.

The suite is u-dq126 installed at ~pag548/roses/u-dq126.

This is a variant of the ACCESS rAM3 suite, a regional atmospheric model. This variant runs in three domains:

Resolution name Model Vert Levels Size NPROCY NPROCX
12 km GAL9 L70_80km 580x780 32 30
5 km RAL3P2 L90_40 km 976x1446 54 48
1 km RAL3P2 L90_40 km 2112x2000 54 48

I recently added the folowing MPI-I/O environmental variables to the jobs which run the 5km and 1 km domains (see ~pag548/roses/u-dq126/site/nci-gadi/suite-adds.rc)

ROMIO_HINTS = /g/data/gb02/public/io_hints.txt
OMPI_MCA_io = romio321

Where

### Striping to match input file striping for next run
striping_factor 8
striping_unit 5242880
cb_nodes 8
cb_buffer_size 8388608

Four forecast tasks ran successfully in the 5km domain ran using these MPI-I/O variables, providing some reduction in run time and SU usage.
One forecast task ran successfully in the 1 km domain, but the second task failed inside 41 seconds. It produced this error. I’ve re-submitted the job a few times, but it’s failed each time with the same error.

I assume the error is related to the MPI-I/O settings. I’m currently reading the Unified Model on I/O tuning, as well as various older Cray presentations and journal articles discussing Unified Model I/O optimisation.

If NCI have any suggestions on how to interpret this error, it would be greatly appreciated.

(error message)
[gadi-cpu-clx-1971:2726498:0:2726498] ib_mlx5_log.c:177 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-1971:2726498:0:2726498] ib_mlx5_log.c:177 DCI QP 0x1c476 wqe[254]: RDMA_READ s-- [rqpn 0x1b85c rlid 3656] [rva 0x145a8ca221c0 rkey 0x9b864] [va 0x1519ee3dd000 len 409568 lkey 0x7692f]
[gadi-cpu-clx-1975:1818122:0:1818122] ib_mlx5_log.c:177 Remote OP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[gadi-cpu-clx-1975:1818122:0:1818122] ib_mlx5_log.c:177 DCI QP 0xa343 wqe[493]: RDMA_READ s-- [rqpn 0x50c3 rlid 1798] [rva 0x1533f2e30900 rkey 0x11ad7f] [va 0x152529902000 len 530464 lkey 0xfcdac]
==== backtrace (tid:1818122) ====
0 0x0000000000025da7 uct_ib_mlx5_completion_with_err() ???:0
1 0x00000000000555ea uct_dc_mlx5_ep_handle_failure() ???:0
2 0x0000000000027a85 uct_ib_mlx5_check_completion() ???:0
3 0x0000000000058157 uct_dc_mlx5_iface_progress_ll() :0
4 0x000000000004f7da ucp_worker_progress() ???:0

etc.

I have archived the full logs for these failures.

Ben Menadue replied:

Hi Paul,

Just to add, failure messages like this are usually actually just artefacts of something else – if another MPI process in the job is killed mid-communication then this completion error is exactly what you would expect (i.e. you’re trying to read memory of a remote process, but that process and it’s memory space as gone away).

So having the job id and the path to the output files for the job (e.g. the .oand .e files, unless you’ve redirected the stderr of your program elsewhere) would allow us to investigate this possibility.

I followed up with

Thanks for the quick follow up.

You can find all the rose task log and job files at:

/home/548/pag548/cylc-run/u-dq126/log.20250703T070947Z/job/20220227T0000Z/Flagship_ERA5to1km_1km_RAL3P2_um_fcst_001

The first time the job ran, the UM iterated from timestep 360 to 477 with no dramas and then suddenly failed. (see ./01/job.out)

The task failed at 04:56. The corresponding stderr file is ./01/job.err

The following files in subdirectories 02-06 denote the automatic resubmission attempts, and manual resubmission attempts.

They all failed almost instantly, with the same error message.

Now funnily enough, I ran the suite again today with the same configuration and it worked!

See the corresponding files in /home/548/pag548/cylc-run/u-dq126/log.20250704T011034Z

NCI then replied highlighting the mismatch b/w the OpenMPI versions.

Hi Paul,

In the first instance switching to

module load openmpi/4.1.5

in the nci-gadi/suite-adds.rc UM_ENV section should get over that error with a mismatch between the openMPI version it is built and run with.

This would be inconsistent with these other module loads:

module load gcom/7.8_ompi.4.1.4
module load drhook/1.1_ompi.4.1.4

but we don’t believe these are used in the spack built executable.

Can you give that a crack and see if it fixes your issue?