I finally made it running.
I’ve added a --oversubscribe
flag to mpirun
, so it launches executables even it is not enough resources (not the work case, but made things more clear).
The story is that while slurm allocates the proper resources (--nodes=8 --ntasks=256 --ntasks-per-node=32
):
[ntilinin@login02 test]$ squeue -u ntilinin
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12680859 boost_usr mpi_scri ntilinin R 21:07 8 lrdn[0257,0287,0863,0977,2113,2373,2550,2587]
mpirun
only uses 1 node (tested it with test mpi hello world) as OpenMPI 4.1.4 from ACCESS NRI package was built using ‘—without-slurm'
option. mpirun
or srun
would not work properly in this case. The ORTE error was due to ssh communication between nodes not configured/not working.
@Aidan, probably this might contribute to the future plans for slurm adaptation you mentioned - install OpenMPI with slurm so that srun
command works (‘—without-slurm'
option disabled).
It runs now on one node only (32 cores) and fails with:
==> NOTE from ocean_model_init: reading maskmap information from > INPUT/ocean_mask_table
parse_mask_table: Number of domain regions masked in ocean model = 24FATAL from PE 0: fms_io(parse_mask_table_2d): mpp_npes() .NE. layout(1)*layout(2) - nmask for ocean model
Probably this has something to do with the ocean mask size, the number of MPI processes does not fit 32 cores and requires more.
@angus-g, you gave a right hint, thank you!
@anton, thank you for advices! This was the beginning of my journey last year with ACCESS OM2, I compiled it with system provided OpenMPI and netcdf from the COSIMA repo. That worked somehow, but then the COSIMA repo was no longer supported and migrated to ACCESS NRI and I was advised to use spack, we installed everything with Harshula online from ACCESS NRI package using spack (model *.exe files, OpenMPI, etc.)
I’m aware (read somewhere on forum) that spack built installation runs slower than executables compiled with system provided OpenMPI, but at this point I just want to make it working one or the other way.
So now I probably should go back to the beginning and compile the model with system wide OpenMPI and netcdf again. Leornardo supports spack and I can probably install the OpenMPI and netcdf versions that would be compatible with ACCESS OM2 source code, will try to workaround with spack now (helpdesk only works with tickets/e-mails, they are located in another city, might use this option later on too).
Many thanks again for help!