The just released 25km global ACCESS-OM3 configuration runs on normalsr
and has the magic incantation
Currently you have to manually specify those platform numbers.
The just released 25km global ACCESS-OM3 configuration runs on normalsr
and has the magic incantation
Currently you have to manually specify those platform numbers.
My understanding is that it runs on the normal queue too (the hardware is quite similar), but wonāt run on Broadwell.
@minghangli can confirm!
Saying that, there is no reason not to use sapphire rapids. Itās faster and the cost is the same.
So this must be run on the normalsr queue, correct?
Not necessarily - it can be run on both normalsr
(SR) and normal
(CL) queues, but it runs faster on SR. Weāre working on optimising the compiler flags and plan to dedicate it to SR
long-term for maximum speedup. Since SR is newer, faster, and priced the same as CL, it makes the most sense to use SR moving forward.
Hi Angus.
The quest for a debuggable executable goes back to these tests on a small test domain.
I could run a small test case at two resolutions and multiple decompositions, but when I ran with a 10x10 decomposition at 0.05 degree resolution, it generated this error.
FATAL from PE 86: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 306
1 0
FATAL from PE 82: ERROR: ESMF mesh and MOM6 domain masks are inconsistent! - MOM n, maskMesh(n), mask(n) = 81
1 0
I did some tests on the masks here:
Iāve just run some tests to ensure the bathymetry exists for every grid point outside of the land (i.e. no NaNs in the ocean).
bathymetry = xr.open_dataset(f'{input_dir}/bathymetry.nc')
bathymetry_grid = xr.DataArray(bathymetry.depth.values.reshape(249,140),
dims=['ny','nx'],
coords={'ny':ocean_mask.ny,
'nx':ocean_mask.nx})
(bathymetry_grid.isnull() & ESMF_mask.astype(bool)).sum()
The sum of this quantity is zero. The bathymetry is consistent with the mask.
bathymetry_grid[100,101]=np.nan
the sum is non-zero.Anyway, I still generate that inconsistent mask error using an older executable. But if I run the same configuration with the debug executable, I generate the error stated above, i.e.
forrtl: error (65): floating invalid
Image PC Routine Line Source
libpthread-2.28.s 0000149D19D07990 Unknown Unknown Unknown
access-om3-MOM6-C 00000000010E58EF limit_topography 422 MOM_shared_initialization.F90
So the plan was create an OM3 executable with debug flags to try and understand the interaction between MOM6 and NUOPCY better.
If this isnāt possible, we may have to revert to good āol print statements and ASCII output files.
Anyway, I tried to load the debug executable access-om3/pr120-19 access-om3-MOM6-CICE6
into TotalView at NCI.
I also tried to load a core dump. TotalView exits in both cases. Itās unable to read all the symbols. Some comments from NCI
It looks like it has been built against your own copies of a number of system libraries ā including things like zlib, bzip2, zstd, which are likely innocuous here, but I also see HDF5, the Intel runtime libraries, the GCC runtime libraries, and the C++ standard library. The latter two are the ones that are most concerning to me as those are not always ABI-compatible between versions. Specifically, I can see that youāre linking the MPI C++ interface library from our installation, which was compiled against the system version of those libraries, so may not work with yours.
Is it possible to build that binary outside of your Spack system? That will make it easier to see whatās going on and hopefully ensure that youāre using compatible libraries. Also, if youāre not actually using the C++ interface to MPI (and you shouldnāt be, since it was removed from the standard over a decade ago), consider linking with mpicc instead of mpicxx so it doesnāt link that interface library in.
That said, I donāt think any of that would explain the error youāre seeing. Unfortunately it doesnāt really tell us much about what went wrong. Is TotalView logging anything else anywhere, e.g. to stdout or stderr?
Is there a chance that you can modify your spack to use more of the system libraries? you can do that by loading the modules you want to consume and doing
spack external find --all
then before building the model you can do the spack spec ⦠to see that it is actually using the system picked up libraries. That will make it easier to debug what the issue might be.
A further question, can you test TotalView on a simpler MPI program? Just a hello world like:
program mpi_hello_world
use mpi_f08
use, intrinsic :: iso_fortran_env, only: int32
implicit none
integer(int32) :: rank, size, ierr
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
print *, 'Hello from rank', rank, 'of', size
call MPI_Finalize(ierr)
end program mpi_hello_world
Then
mpifort hello.f90
mpirun -np 4 --map-by ppr:2:node ./a.out
If this works well then the problem might be because of the extra spack dependencies built for the model that are colliding with system ones.
Iām off to test TotalView with that simple program. Iāll also test DDT
on Gadi as well.
fwiw I was saying that you canāt have a NaN anywhere in the domain, which is why itās easier to set a non-NaN _FillValue
attributeā¦
Ah. That sounds like an important point.
Some further comments from NCI. The debug version of OM3 is built using the CUDA Nvidia libraries. Itās possible TotalView is struggling to parse the GPU symbols.
Iām going to try the DDT debugger on that executable to see if it can load it.
EDIT : DDT can load and run the executable (and reproduce the error) but it can only load symbols (i.e view source code) for the hdf5, netcdf4 and parallelio source.
So Iām going to take a branch and follow some of NCIās suggestions to have a fully working version in a debugger.