PAYU issues on Leonardo

I finally made it running.

I’ve added a --oversubscribe flag to mpirun, so it launches executables even it is not enough resources (not the work case, but made things more clear).

The story is that while slurm allocates the proper resources (--nodes=8 --ntasks=256 --ntasks-per-node=32):

[ntilinin@login02 test]$ squeue -u ntilinin
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12680859 boost_usr mpi_scri ntilinin  R       21:07      8 lrdn[0257,0287,0863,0977,2113,2373,2550,2587]

mpirun only uses 1 node (tested it with test mpi hello world) as OpenMPI 4.1.4 from ACCESS NRI package was built using ‘—without-slurm' option. mpirun or srun would not work properly in this case. The ORTE error was due to ssh communication between nodes not configured/not working.

@Aidan, probably this might contribute to the future plans for slurm adaptation you mentioned - install OpenMPI with slurm so that srun command works (‘—without-slurm' option disabled).

It runs now on one node only (32 cores) and fails with:

==> NOTE from ocean_model_init: reading maskmap information from > INPUT/ocean_mask_table
parse_mask_table: Number of domain regions masked in ocean model = 24

FATAL from PE 0: fms_io(parse_mask_table_2d): mpp_npes() .NE. layout(1)*layout(2) - nmask for ocean model

Probably this has something to do with the ocean mask size, the number of MPI processes does not fit 32 cores and requires more.

@angus-g, you gave a right hint, thank you!

@anton, thank you for advices! This was the beginning of my journey last year with ACCESS OM2, I compiled it with system provided OpenMPI and netcdf from the COSIMA repo. That worked somehow, but then the COSIMA repo was no longer supported and migrated to ACCESS NRI and I was advised to use spack, we installed everything with Harshula online from ACCESS NRI package using spack (model *.exe files, OpenMPI, etc.)
I’m aware (read somewhere on forum) that spack built installation runs slower than executables compiled with system provided OpenMPI, but at this point I just want to make it working one or the other way.

So now I probably should go back to the beginning and compile the model with system wide OpenMPI and netcdf again. Leornardo supports spack and I can probably install the OpenMPI and netcdf versions that would be compatible with ACCESS OM2 source code, will try to workaround with spack now (helpdesk only works with tickets/e-mails, they are located in another city, might use this option later on too).

Many thanks again for help!

1 Like

Glad to hear you are making progress !

1 Like

You can use the system OpenMPI by setting mpi as an external non-buildable package, e.g.

Note that this configuration contains a number of versions, but you would not need to specify that many, or your HPC system might provide a similar package file you could refer to or use.

netCDF is straightforward to compile, so I would just build that with spack if you’re happy to do so.

1 Like

It is running on Leonardo. The whole adjustment of OpenMPI and other things was quite painful :sweat_smile:
The key issues was related to force OpenMPI (built with spack, as Leonardo doesnt’ have OpenMPI module compiled with intel compilers, only gcc and nvhpc).
The very specific thing was forcing spack built OpenMPI to use Infiniband to exchange between nodes, by default it was picking the TCP Ethernet. Also the srun with heterogeneous jobs.
Now I have this whole thing running:

6:53 $ sacct -j 13143246 --format=JobID,JobName,State,Start,End,Elapsed,NCPUS,ReqTRES,AllocTRES
JobID           JobName      State               Start                 End    Elapsed      NCPUS    ReqTRES  AllocTRES 
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- ---------- 
13143246           wrap    RUNNING 2025-02-28T06:11:30             Unknown   00:47:41        560 billing=5+ billing=5+ 
13143246.ba+      batch    RUNNING 2025-02-28T06:11:30             Unknown   00:47:41        112            cpu=112,g+ 
13143246.ex+     extern    RUNNING 2025-02-28T06:11:30             Unknown   00:47:41        560            billing=5+ 
13143246.0+0   yatm.exe    RUNNING 2025-02-28T06:11:44             Unknown   00:47:27          1            cpu=1,gre+ 
13143246.0+1 fms_ACCES+    RUNNING 2025-02-28T06:11:44             Unknown   00:47:27        216            cpu=216,g+ 
13143246.0+2 cice_ausc+    RUNNING 2025-02-28T06:11:44             Unknown   00:47:27         24            cpu=24,gr+

It produces the .err and .out files, but MOM keeps producing this output for different ocean 2-d and 3-d fields:

WARNING from PE     0: diag_util_mod::opening_file: one axis has auxiliary but the corresponding field is NOT found in file ocean-3d-ty_trans-1-monthly-mean-ym%4yr%2mo

WARNING from PE     0: diag_util_mod::opening_file: one axis has auxiliary but the corresponding field is NOT found in file ocean-2d-ty_trans_int_z-1-monthly-mean-ym%4yr%2mo

WARNING from PE     0: diag_util_mod::opening_file: one axis has auxiliary but the corresponding field is NOT found in file ocean-2d-mld-1-monthly-mean-ym%4yr%2mo

However it keeps going and not failing.

Maybe this error has been seen already? Would be grateful for any hint on it.

Thanks as usual!

Good to hear its working !

Has a model run through to completion and then you can restart the model for a second run ?

Are your Payu changes on github ? There may be some lessons for us to integrate into Payu :slight_smile:

I think this warning is ok and we also get this warning - it just means that the metadata for a variable is referring to a different variable which is not in that file. This is ok because the variable it is referencing is in a different output file.

Thanks!

Yes, I’ve run 2 years long experiment and then restarted it from the last saved restarts for another 25 years. All works well, can’t believe it :boom:

I’ll soon create forks for all repos I modified and push my changes.
I’ll also sum up all the adjustments in the description file. Will post here all links.

@anton, am I getting right that probably the the easiest way to use ESMValTool on Leonardo to evaluate the model? As all other software is adapted to Gadi and use access-nri-intake-catalog?

3 Likes

If you’re happy to do so I’d suggest making a new topic for this question. Then we can close this one and get some of the model evaluation team involved to assist.

This is a tricky question somewhat - it depends on what you are trying evaluate and what the goals are … in many ways it also depends on what you colleagues use.

ESMValTool has recipes in it for CMIP style inter-comparison of model results. I think it is most targetted at repeating the same analysis across different modelling centres, and where analysis is targetting the robustness and confidence in the model results and evaluate the performance of models against observations or against predecessor versions of the same models (Righi et al 2020)

The access-nri-intake-catalog is built on the intake-esm framework. This is a tool for making catalogues - i.e. a catalog for searching and finding data and variables for experiments (and observational) data. The data can be loaded (typically into an xarray dataset, but could be something else). The actual analysis then is up to the individual. cosima-recipes has examples of using intake-esm to load data into a xarray object. You could use the builders from the access-nri-intake-catalog to make small intake-esm datastores for your experiments, or you could load you data directly into xarray objects using their file paths and xarray’s open_mfdataset

1 Like

Sure! Apologies for overposting.

Not at all! We’re very excited to have an international collaborator!

It’s just that if we’re pivoting to a new problem a new topic would be a good move.

Tag with help to make sure the triage team pick it up.

Cheers

Aidan

1 Like

I’ll close this now, but please do open a new topic if you need further assistance @Natalia

1 Like