Installing ACCESS-OM2 on NeSI (New Zealand supercomputer)

Thanks Aidan. I am trying one more time then… FYI the edit to the spack.yaml is this:

   netcdf-c:
      require:
      - '@4.9.2~blosc'

For what it’s worth, one thing I’ve found useful when running the spack install is to do this:

spack -d -v install > install_log.txt 2> error_log.txt

Because that gives me some information about what is going on with the cmake and compiler steps. It’s all a bit too hidden otherwise.

1 Like

Update on this. Unfortunately my attempted spack build crashed again later. Something went wrong in the access-generic-tracers part of it, and like I said, I’m finding it too difficult to figure out what’s going wrong with my Spack attempts, so I’m leaving that for now.

I have managed to build the ACCESS-OM2 on NeSI using the old COSIMA setup, so that’s a relief! I’m cautious about declaring victory too soon, but the model is now running for ~3 months without crashing yet.

3 Likes

Ok, the model seems to run fine for one year, even outputs diagnostic and restart files. At the end of the run, CICE is crashing due to some kind of MPI problem at the end of the run:

MPI traceback error handler called by rank:           65
Image              PC                Routine            Line        Source
cice_auscom_360x3  000000000097A917  Unknown               Unknown  Unknown
cice_auscom_360x3  000000000081FEB2  Unknown               Unknown  Unknown
libmpi.so.12.0.0   0000152877DF4E9B  MPIR_Err_return_c     Unknown  Unknown
libmpi.so.12.0.0   000015287826B7EA  PMPI_Send             Unknown  Unknown
libmpifort.so.12.  00001528773720E3  PMPI_SEND             Unknown  Unknown
cice_auscom_360x3  0000000000792A1B  accessom2_mod_mp_         833  accessom2.F90
cice_auscom_360x3  0000000000411BFB  cice_finalmod_mp_          85  CICE_FinalMod.f90
cice_auscom_360x3  000000000041189E  MAIN__                     76  CICE.f90
cice_auscom_360x3  0000000000411822  Unknown               Unknown  Unknown
libc.so.6          0000152876E29590  Unknown               Unknown  Unknown
libc.so.6          0000152876E29640  __libc_start_main     Unknown  Unknown
cice_auscom_360x3  0000000000411725  Unknown               Unknown  Unknown

I will fight that battle another day…

4 Likes

Hi David,

How does the submission line from payu looks?

I have a modified version of payu for SLURM job submission that works on Leonardo (under SLURM)

Natalia

1 Like

It took me a while to find out that in my case (idk why) the MPI ranks were not properly communicating through InfiniBand until these systemwide variables were set before the job submission:

export OMPI_MCA_btl_tcp_if_include=ib0
export MCA_IO=ompio
export MCA_IO_OMPIO_NUM_AGGREGATORS=1
1 Like

Hi Natalia!
Thanks for your suggestion. The fun part here is that I’m not using OpenMPI. I’m using IntelMPI. So the OMPI flags are not directly applicable here. I don’t know for sure that it’s an mpirun problem yet, or something I did wrong with the compilation. I will probably try re-running my compile of CICE5 just to make sure I didn’t accidentally mess something up. For what it’s worth, my actual mpirun submission from within payu looks like this:

mpirun  -wdir /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/atmosphere -np 1  /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/atmosphere/yatm.exe : -wdir /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/ocean -np 64  /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/ocean/fms_ACCESS-OM.x : -wdir /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/ice -np 24  /home/david.hutchin6926/00_nesi_projects/vuw04597_nobackup/david.hutchin6926/access-om2/work/1deg_COS_1m-c9062aa4/ice/cice_auscom_360x300_24p.exe

I.e. no special flags at all. I’ll probably have to revisit this next week. I will also be happy to share my modifications to payu (next week)… as yet I haven’t created a new github fork for it.

This line looks quite innocuous:

It’s trying to send cice model time to matm to confirm that the time in the same across the model components.

It might be helpful to add a call to MPI_err_string and write the result to stderr

something like



integer :: ierr, errcode, resultlen
character(len=MPI_MAX_ERROR_STRING) :: errstr

...

    else
        if (self%my_local_pe == 0) then
            buf(1) = checksum
            call MPI_Comm_set_errhandler(MPI_COMM_WORLD,MPI_ERRORS_RETURN)
            call MPI_send(buf, 1, MPI_INTEGER, self%atm_ic_root, tag, &
                           MPI_COMM_WORLD, err)
            if (err != MPI_SUCCESS) then
               call MPI_Error_string(err, errstr, resultlen)
               write(stderr,*) 'MPI error: ', trim(errstr))
           endif
        endif
    endif


But it might also be crashing before returning the error code

1 Like

We have an unfinished project to add HPCpy to payu, one of the drivers of which is slurm support and better cross platform support in general. The plan is for that to get revived in the next month or so.

We’d welcome testers when we have something working if you’re interested.

2 Likes

Hi @dkhutch !

Do you allocate resources before launching mpirun within payu?

I would try to use srun, so that the scheduler distributes processes properly. I might be wrong, but could it be that mpirun launches the whole thing on the login node? It is 126 cores per node on NeSI? Hardware - Support Documentation

On Leonardo I have 112 cores per node and my submission line looks as (truncated):

srun -A ICT25_MHPC --time=00:01:00 --partition=dcgp_usr_prod --chdir ./ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/atmosphere **–**nodes=1 --ntasks=1 --ntasks-per-node=1/leonardo_scratch/large/userexternal/ntilinin/access-om2/1deg_jra55_ryf/ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/atmosphere/yatm.exe : –partition=dcgp_usr_prod --chdir./ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/ocean*–nodes=8 --ntasks=256 --ntasks-per-node=32*/leonardo_scratch/large/userexternal/ntilinin/access-om2/1deg_jra55_ryf/ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/ocean/fms_ACCESS-OM.x : –partition=dcgp_usr_prod --chdir./ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/ice –nodes=2 --ntasks=64 --ntasks-per-node=32 /leonardo_scratch/large/userexternal/ntilinin/access-om2/1deg_jra55_ryf/ntilinin/access-om2/work/1deg_jra55_ryf-expt-b559b2a8/ice/cice_auscom_360x300_24x1_24p.exe etc

While in config.yaml:

scheduler: slurm
#queue: do not set for slurm
walltime: 03:00:00
jobname: 1deg_jra55_ryf_bench_cice5
mem: 200G
account: ICT25_MHPC
partition: dcgp_usr_prod
#project: do not set for slurm, creates separate folder
ncpus: 672
nnodes: 6

Hi Natalia,
I’m submitting the job through sbatch and I can see it’s making it to the compute nodes ok. Then it does the payu-run script from the compute nodes which calls either mpirun or mpiexec (I’ve tried both… they seem to do the same thing).
But, I am going to follow up with NeSI support just to make sure I’m not missing some important feature of the job submission.

It’s curious that your method uses srun instead of mpirun. I don’t know if that would work on mine as it needs to find the correct mpirun wrapper for my compiler but I could check.

To update you on my payu hacks, here is my forked version of payu/1.1 that I am using on Nesi:

Currently it’s pretty messy. The reason I chose v1.1 is because that was the last one I could find that still had the bin folder with payu-run and payu shell commands. After that it I think goes into singularity to run the shell commands instead… which I couldn’t make sense of so I skipped it.

Ok… so I couldn’t figure out how to fix my deprecated build that was crashing at the end of CICE5.

I did however go back and try again with the spack installation based on Harshula’s instructions:

And I got it to run successfully with no error! I think I must have stuffed up the mpirun call earlier, and the PMI and SLURM issues are resolved by calling the correct path for mpirun. So, I may have just run around in a giant circle with my alternative build, but never mind.

So, thank you to @harshula for providing this build recipe. In the end spack wins.

I did however use a bunch of spack develop steps to download the code packages and edit the AVX2 vectorization flags. My spack.yaml is attached:

spack.yaml (1.9 KB)

In each of my “develop” packages, I just did a recursive grep for -xCORE-AVX2 or -axCORE-AVX2 and then updated that flag to -mavx2 to avoid the Intel / AMD problem. Mostly was updating CMakeLists.txt.

Strangely, some of my edits (e.g. in the CICE5 build) got ignored, and the -axCORE-AVX2 persisted in the build. This didn’t seem to matter (I don’t know why). It did matter for the MOM5 build, and my edits to CMakeLists.txt did change the compilation to avoid the -xCORE-AVX2 flag.

Regards,
David

3 Likes

Hi @dkhutch , That’s great news, well done! Can you please create PRs for ACCESS-NRI repositories that you had to modify so that we can merge your changes? e.g.

  develop:
    access-fms:
      spec: access-fms@git.mom5-2025.05.000=mom5
    access-generic-tracers:
      spec: access-generic-tracers@=2025.09.000
    access-mocsy:
      spec: access-mocsy@=2025.07.002
    libaccessom2:
      spec: libaccessom2@git.2025.05.001=access-om2
    mom5:
      spec: mom5@git.2025.08.000=access-om2
    cice5:
      spec: cice5@git.2025.03.001=access-om2
    oasis3-mct:
      spec: oasis3-mct@git.2025.03.001=2025.03.001

Thanks! Alternatively, if you provide us the diffs, we can incorporate the changes.

Ok, so I made a bunch of pull requests for the changes I implemented. Apologies if I did anything silly in my PRs… I am not so familiar with doing those.

3 Likes

placeholder

placeholder

@dkhutch, just a note that the latest release of ACCESS-OM2 (2026.02.001) includes fixes to MOM5 and libaccessom2 to allow compiling and running with GCC.

Note that this version of ACCESS-OM2 does not reproduce answers from the previous version. The latent heat of vapourisation used by MOM was changed slightly to be consistent with what is used in CICE.

2 Likes