Update on this. Unfortunately my attempted spack build crashed again later. Something went wrong in the access-generic-tracers part of it, and like I said, I’m finding it too difficult to figure out what’s going wrong with my Spack attempts, so I’m leaving that for now.
I have managed to build the ACCESS-OM2 on NeSI using the old COSIMA setup, so that’s a relief! I’m cautious about declaring victory too soon, but the model is now running for ~3 months without crashing yet.
Ok, the model seems to run fine for one year, even outputs diagnostic and restart files. At the end of the run, CICE is crashing due to some kind of MPI problem at the end of the run:
It took me a while to find out that in my case (idk why) the MPI ranks were not properly communicating through InfiniBand until these systemwide variables were set before the job submission:
Hi Natalia!
Thanks for your suggestion. The fun part here is that I’m not using OpenMPI. I’m using IntelMPI. So the OMPI flags are not directly applicable here. I don’t know for sure that it’s an mpirun problem yet, or something I did wrong with the compilation. I will probably try re-running my compile of CICE5 just to make sure I didn’t accidentally mess something up. For what it’s worth, my actual mpirun submission from within payu looks like this:
I.e. no special flags at all. I’ll probably have to revisit this next week. I will also be happy to share my modifications to payu (next week)… as yet I haven’t created a new github fork for it.
But it might also be crashing before returning the error code
1 Like
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
48
We have an unfinished project to add HPCpy to payu, one of the drivers of which is slurm support and better cross platform support in general. The plan is for that to get revived in the next month or so.
We’d welcome testers when we have something working if you’re interested.
Do you allocate resources before launching mpirun within payu?
I would try to use srun, so that the scheduler distributes processes properly. I might be wrong, but could it be that mpirun launches the whole thing on the login node? It is 126 cores per node on NeSI? Hardware - Support Documentation
On Leonardo I have 112 cores per node and my submission line looks as (truncated):
scheduler: slurm #queue: do not set for slurm
walltime: 03:00:00
jobname: 1deg_jra55_ryf_bench_cice5
mem: 200G
account: ICT25_MHPC
partition: dcgp_usr_prod #project: do not set for slurm, creates separate folder ncpus: 672
nnodes: 6
Hi Natalia,
I’m submitting the job through sbatch and I can see it’s making it to the compute nodes ok. Then it does the payu-run script from the compute nodes which calls either mpirun or mpiexec (I’ve tried both… they seem to do the same thing).
But, I am going to follow up with NeSI support just to make sure I’m not missing some important feature of the job submission.
It’s curious that your method uses srun instead of mpirun. I don’t know if that would work on mine as it needs to find the correct mpirun wrapper for my compiler but I could check.
To update you on my payu hacks, here is my forked version of payu/1.1 that I am using on Nesi:
Currently it’s pretty messy. The reason I chose v1.1 is because that was the last one I could find that still had the bin folder with payu-run and payu shell commands. After that it I think goes into singularity to run the shell commands instead… which I couldn’t make sense of so I skipped it.
Ok… so I couldn’t figure out how to fix my deprecated build that was crashing at the end of CICE5.
I did however go back and try again with the spack installation based on Harshula’s instructions:
And I got it to run successfully with no error! I think I must have stuffed up the mpirun call earlier, and the PMI and SLURM issues are resolved by calling the correct path for mpirun. So, I may have just run around in a giant circle with my alternative build, but never mind.
So, thank you to @harshula for providing this build recipe. In the end spack wins.
I did however use a bunch of spack develop steps to download the code packages and edit the AVX2 vectorization flags. My spack.yaml is attached:
In each of my “develop” packages, I just did a recursive grep for -xCORE-AVX2 or -axCORE-AVX2 and then updated that flag to -mavx2 to avoid the Intel / AMD problem. Mostly was updating CMakeLists.txt.
Strangely, some of my edits (e.g. in the CICE5 build) got ignored, and the -axCORE-AVX2 persisted in the build. This didn’t seem to matter (I don’t know why). It did matter for the MOM5 build, and my edits to CMakeLists.txt did change the compilation to avoid the -xCORE-AVX2 flag.
Hi @dkhutch , That’s great news, well done! Can you please create PRs for ACCESS-NRI repositories that you had to modify so that we can merge your changes? e.g.
Ok, so I made a bunch of pull requests for the changes I implemented. Apologies if I did anything silly in my PRs… I am not so familiar with doing those.
@dkhutch, just a note that the latest release of ACCESS-OM2 (2026.02.001) includes fixes to MOM5 and libaccessom2 to allow compiling and running with GCC.
Note that this version of ACCESS-OM2 does not reproduce answers from the previous version. The latent heat of vapourisation used by MOM was changed slightly to be consistent with what is used in CICE.