I finally made it running.
I’ve added a --oversubscribe flag to mpirun, so it launches executables even it is not enough resources (not the work case, but made things more clear).
The story is that while slurm allocates the proper resources (--nodes=8 --ntasks=256 --ntasks-per-node=32):
[ntilinin@login02 test]$ squeue -u ntilinin
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12680859 boost_usr mpi_scri ntilinin R 21:07 8 lrdn[0257,0287,0863,0977,2113,2373,2550,2587]
mpirun only uses 1 node (tested it with test mpi hello world) as OpenMPI 4.1.4 from ACCESS NRI package was built using ‘—without-slurm' option. mpirun or srun would not work properly in this case. The ORTE error was due to ssh communication between nodes not configured/not working.
@Aidan, probably this might contribute to the future plans for slurm adaptation you mentioned - install OpenMPI with slurm so that srun command works (‘—without-slurm' option disabled).
It runs now on one node only (32 cores) and fails with:
==> NOTE from ocean_model_init: reading maskmap information from > INPUT/ocean_mask_table
parse_mask_table: Number of domain regions masked in ocean model = 24FATAL from PE 0: fms_io(parse_mask_table_2d): mpp_npes() .NE. layout(1)*layout(2) - nmask for ocean model
Probably this has something to do with the ocean mask size, the number of MPI processes does not fit 32 cores and requires more.
@angus-g, you gave a right hint, thank you!
@anton, thank you for advices! This was the beginning of my journey last year with ACCESS OM2, I compiled it with system provided OpenMPI and netcdf from the COSIMA repo. That worked somehow, but then the COSIMA repo was no longer supported and migrated to ACCESS NRI and I was advised to use spack, we installed everything with Harshula online from ACCESS NRI package using spack (model *.exe files, OpenMPI, etc.)
I’m aware (read somewhere on forum) that spack built installation runs slower than executables compiled with system provided OpenMPI, but at this point I just want to make it working one or the other way.
So now I probably should go back to the beginning and compile the model with system wide OpenMPI and netcdf again. Leornardo supports spack and I can probably install the OpenMPI and netcdf versions that would be compatible with ACCESS OM2 source code, will try to workaround with spack now (helpdesk only works with tickets/e-mails, they are located in another city, might use this option later on too).
Many thanks again for help!