Setting build options for OM3

Aidan · 31 July 2025 11:34

The just released 25km global ACCESS-OM3 configuration runs on normalsr and has the magic incantation

github.com/ACCESS-NRI/access-om3-configs

config.yaml

release-MC_25km_jra_ryf


      
              - /g/data/vk83/configurations/inputs/access-om3/cmeps/remap_weights/global.25km/2025.05.15/access-om3-25km-rof-remap-weights.nc
              - /g/data/vk83/configurations/inputs/JRA-55/RYF/v1-6/data
          
          runlog: true
          metadata: 
              enable: true
          manifest:
            reproduce:
              exe: True
          
          platform:
              nodesize: 104
              nodemem: 512
          
          collate:
            restart: true
            mpi: true
            walltime: 1:00:00
            mem: 30GB
            ncpus: 4
            queue: expresssr

Currently you have to manually specify those platform numbers.

anton · 31 July 2025 22:54

My understanding is that it runs on the normal queue too (the hardware is quite similar), but won’t run on Broadwell.

@minghangli can confirm!

Saying that, there is no reason not to use sapphire rapids. It’s faster and the cost is the same.

minghangli · 31 July 2025 23:27

So this must be run on the normalsr queue, correct?

Not necessarily - it can be run on both normalsr (SR) and normal (CL) queues, but it runs faster on SR. We’re working on optimising the compiler flags and plan to dedicate it to SR long-term for maximum speedup. Since SR is newer, faster, and priced the same as CL, it makes the most sense to use SR moving forward.

Paul.Gregory · 1 August 2025 06:44

Hi Angus.

The quest for a debuggable executable goes back to these tests on a small test domain.

I could run a small test case at two resolutions and multiple decompositions, but when I ran with a 10x10 decomposition at 0.05 degree resolution, it generated this error.

FATAL from PE    86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =      306        
 1         0

FATAL from PE    82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) =       81        
 1         0

I did some tests on the masks here:

I’ve just run some tests to ensure the bathymetry exists for every grid point outside of the land (i.e. no NaNs in the ocean).

bathymetry = xr.open_dataset(f'{input_dir}/bathymetry.nc')
bathymetry_grid = xr.DataArray(bathymetry.depth.values.reshape(249,140), 
                         dims=['ny','nx'],
                         coords={'ny':ocean_mask.ny, 
                                 'nx':ocean_mask.nx})

(bathymetry_grid.isnull() & ESMF_mask.astype(bool)).sum()

The sum of this quantity is zero. The bathymetry is consistent with the mask.

If I set bathymetry_grid[100,101]=np.nan the sum is non-zero.

Anyway, I still generate that inconsistent mask error using an older executable. But if I run the same configuration with the debug executable, I generate the error stated above, i.e.

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
libpthread-2.28.s  0000149D19D07990  Unknown               Unknown  Unknown
access-om3-MOM6-C  00000000010E58EF  limit_topography          422  MOM_shared_initialization.F90

So the plan was create an OM3 executable with debug flags to try and understand the interaction between MOM6 and NUOPCY better.

If this isn’t possible, we may have to revert to good ‘ol print statements and ASCII output files.

Paul.Gregory · 1 August 2025 07:03

Anyway, I tried to load the debug executable access-om3/pr120-19 access-om3-MOM6-CICE6 into TotalView at NCI.

I also tried to load a core dump. TotalView exits in both cases. It’s unable to read all the symbols. Some comments from NCI

It looks like it has been built against your own copies of a number of system libraries – including things like zlib, bzip2, zstd, which are likely innocuous here, but I also see HDF5, the Intel runtime libraries, the GCC runtime libraries, and the C++ standard library. The latter two are the ones that are most concerning to me as those are not always ABI-compatible between versions. Specifically, I can see that you’re linking the MPI C++ interface library from our installation, which was compiled against the system version of those libraries, so may not work with yours.

Is it possible to build that binary outside of your Spack system? That will make it easier to see what’s going on and hopefully ensure that you’re using compatible libraries. Also, if you’re not actually using the C++ interface to MPI (and you shouldn’t be, since it was removed from the standard over a decade ago), consider linking with mpicc instead of mpicxx so it doesn’t link that interface library in.

That said, I don’t think any of that would explain the error you’re seeing. Unfortunately it doesn’t really tell us much about what went wrong. Is TotalView logging anything else anywhere, e.g. to stdout or stderr?

Is there a chance that you can modify your spack to use more of the system libraries? you can do that by loading the modules you want to consume and doing

spack external find --all

then before building the model you can do the spack spec … to see that it is actually using the system picked up libraries. That will make it easier to debug what the issue might be.

A further question, can you test TotalView on a simpler MPI program? Just a hello world like:

program mpi_hello_world
  use mpi_f08
  use, intrinsic :: iso_fortran_env, only: int32
  implicit none

  integer(int32) :: rank, size, ierr

  call MPI_Init(ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

  print *, 'Hello from rank', rank, 'of', size

  call MPI_Finalize(ierr)
end program mpi_hello_world

Then

mpifort hello.f90
mpirun -np 4 --map-by ppr:2:node ./a.out

If this works well then the problem might be because of the extra spack dependencies built for the model that are colliding with system ones.

I’m off to test TotalView with that simple program. I’ll also test DDT on Gadi as well.

angus-g · 1 August 2025 07:09

fwiw I was saying that you can’t have a NaN anywhere in the domain, which is why it’s easier to set a non-NaN _FillValue attribute…

Paul.Gregory · 1 August 2025 07:28

Ah. That sounds like an important point.

Paul.Gregory · 4 August 2025 03:27

Some further comments from NCI. The debug version of OM3 is built using the CUDA Nvidia libraries. It’s possible TotalView is struggling to parse the GPU symbols.

I’m going to try the DDT debugger on that executable to see if it can load it.

EDIT : DDT can load and run the executable (and reproduce the error) but it can only load symbols (i.e view source code) for the hdf5, netcdf4 and parallelio source.

So I’m going to take a branch and follow some of NCI’s suggestions to have a fully working version in a debugger.

Topic		Replies	Views
Creating a regional ACCESS-OM3 TWG 21centuryweather	20	268	20 February 2025
ACCESS-ROM3 setup instructions Regional MOM6 regional , tutorial , om3	80	719	12 June 2025
COSIMA TWG Meeting Minutes 2023 TWG meeting , twg , notes , minutes	15	1761	11 December 2023
Initial experiments with spack spack spack , compile	17	491	6 December 2022
How to build ACCESS-OM2 on Gadi spack spack , access-om2	12	700	23 February 2024

Setting build options for OM3

Related topics