Payu generated symlinks don't work with ParallelIO library

I am posting here because I don’t know where it fits better!

I am trying to configure CICE in OM3 to use Parallel IO (which supports Netcdf 4, rather than the current Netcdf Classic - Parallel IO in CICE · Issue #81 · COSIMA/access-om3 · GitHub). This is configurable in Nuopc - by changing nuopc.runconfig here.

When an attempt to parallel read a restart file is made, we get this error:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832

This get_stripe failed is because payu provides a symlink to the restart file rather than the file. The payu ‘work’ directiory has an ‘input’ folder with symlinks ( e.g. input/iced.1900-01-01-10800.nc -> /g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc)

Trying this bash shows what is going on:

$ lfs getstripe input/iced.1900-01-01-10800.nc 
input/iced.1900-01-01-10800.nc has no stripe info

when the expected result is

$ lfs getstripe /g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc
/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 6
	obdidx		 objid		 objid		 group
	     6	     138480481	    0x8410b61	   0x3c0000400

I have always thought of symlinks as pretty robust!

Options here are possibly:

  • Update payu to point directly to the file instead of symlinking (would this need a copy of the restart files in the work directory?)
  • Raise the issue with the developers of the ParrallelIO library
  • Something else ?

I haven’t yet thought through how updating payu would work. We wouldn’t want to have to make a copy of every restart file, every time a model component run by payu is initialised.

I have focussed on testing this with CICE & NUOPC, but every model component run by Payu will have the same issue. To update to Netcdf4 and parallel reads for any other component (I tested with the data-atmosphere but all would be impacted) some change will need to be made.

1 Like

I agree this is a decent place to talk about it, as it involves a lot more than just payu.

So why does PIO in CICE5 work with ACCESS-OM2 with symlinks? Is nuopc doing some other step that involves interrogating the striping?

This would constitute a pretty large change in the logic of payu and would be an option of last resort.

I’d plump for figuring out if you can turn off this ifs interrogation step. My recollection from Nic Hannah’s testing was that he didn’t get much IO improvement when he changed the default PIO configuration.

@rui.yang is the one who knows about optimising Lustre striping though.

Yeah this is curious. Ill keep investigating.

The big change was probably from switching to Parallel IO, the configuration of it is probably less important as long as it is reasonable (i.e. 12 vs 24 threads for IO might not change much in the final result due to the limitations in disk access).

Similarly, I expect the default configuration for striping will be fine at our file sizes. It just needs to work.

1 Like

To address something you mentioned in your original post, IIRC there was no discernible difference in performance for 1 deg and even 0.25 deg wasn’t much. It was the tenth that it really made a difference, but @aekiss might recall, or have some actual numbers.

There is some useful stuff in the TWG minutes, search for PIO and start sifting for nuggets of wisdom:

https://cosima.org.au/index.php/category/minutes/

(Feeling vindication for the time it took me to write those damn things)

I am so full of it. cice5 already does this because the driver wanted to modify restarts (and inputs I guess) in-place

So if you needed a work-around it might be your ticket. Would be good not to need it though.

I should really have remembered this, I wrote it. I sort of did, but thought it didn’t apply, so apologies if this has held you up.

(Just want to say, how nice are those one-box code snippet previews. Nice one discourse!)

Ok, great! I was starting to get mighty confused about where the difference from OM2 to OM3 was.

Do I need to build my own payu to to test that? (Are there instructions for building payu?)

This step is buried deep in in the open-mpi library:

Which makes it kind of challenging to do anything about. Fortran also doesn’t have a good way to just read the path it’s trying to point the symlink to.

We could:

  • Update the rpointer files in Payu to use full paths for the restart and input files (instead of using the symlinks).
  • Copy the restart files (as above) in Payu
  • Patch CICE to read using serial input and only write using parallel output. (IO is controlled in nuopc.run config, so it’s messy). This would mean all model components would have to use serial input and have a similar patch applied if we wanted parallel output (I don’t know if that would be slow, the datm files are a lot bigger than the CICE ones but still less than 2GB each).
  • Possibly raise an issue with open-mpi, and make the case the behaviour with Lustre is wrong and they should be checking for symlinks when opening files.

I don’t think there are. I usually load a vanilla conda environment (one without payu):

module use /g/data/hh5/public/modules
module load conda/python3

and then

pip install -e . --user

in the payu source directory.

That installs into ~/.local/bin/payu.

Wow. Good find.

That would break the manifests as they’re currently implemented.

That’s a decent work-around for the time being. Does require a patched payu however.

I think this is worth doing regardless.

1 Like

I raised it here:

1 Like

Just for general information, the OpenMPI libraries used, even in spack builds, are the ones provided by NCI.

You can get more (A LOT MORE) information about the mpi installation with ompi_info in case that is required:

$ ompi_info                        
                 Package: Open MPI apps@gadi-cpu-clx-2915.gadi.nci.org.au    
                          Distribution                                  
                Open MPI: 4.1.4                                          
  Open MPI repo revision: v4.1.4                                          
   Open MPI release date: May 26, 2022                                   
                Open RTE: 4.1.4                                         
  Open RTE repo revision: v4.1.4                                           
   Open RTE release date: May 26, 2022                                      
                    OPAL: 4.1.4                                              
      OPAL repo revision: v4.1.4                                          
       OPAL release date: May 26, 2022                                     
                 MPI API: 3.1.0                                              
            Ident string: 4.1.4                                            
                  Prefix: /apps/openmpi-mofed5.6-pbs2021.1/4.1.4          
 Configured architecture: x86_64-pc-linux-gnu                             
          Configure host: gadi-cpu-clx-2915.gadi.nci.org.au               
           Configured by: apps                                             
           Configured on: Mon Aug  1 04:47:24 UTC 2022                       
          Configure host: gadi-cpu-clx-2915.gadi.nci.org.au                 
  Configure command line: '--prefix=/apps/openmpi-mofed5.6-pbs2021.1/4.1.4' 
                          '--disable-dependency-tracking'                 
                          '--disable-heterogeneous' '--disable-ipv6'     
                          '--enable-orterun-prefix-by-default'            
                          '--enable-sparse-groups' '--enable-mpi-fortran'
                          '--enable-mpi-cxx' '--enable-mpi1-compatibility'
                          '--enable-shared' '--disable-static'          
                          '--disable-wrapper-rpath'                        
                          '--disable-wrapper-runpath' '--disable-mpi-java' 
                          '--enable-mca-static' '--enable-hwloc-pci'    
                          '--enable-visibility' '--with-zlib'             
                          '--with-cuda=/apps/cuda/11.7.0' '--without-pmi'    
                          '--with-ucx=/apps/ucx/1.13.0' '--without-verbs'
                          '--without-verbs-usnic' '--without-portals4'  
                          '--without-ugni' '--without-usnic' '--without-ofi' 
                          '--without-cray-xpmem' '--with-xpmem'            
                          '--with-knem=/opt/knem-1.1.4.90mlnx1' '--with-cma'
                          '--without-x' '--without-memkind'              
                          '--without-cray-pmi' '--without-alps'         
                          '--without-flux-pmi' '--without-udreg'        
                          '--without-lsf' '--without-slurm'                
                          '--with-tm=/opt/pbs/default' '--without-sge'   
                          '--without-moab' '--without-singularity'        
                          '--without-fca' '--with-hcoll=/apps/hcoll/4.7.3208'
                          '--with-ucc=/apps/ucc/1.0.0' '--without-ime'      
                          '--without-pvfs2' '--with-lustre'                
                          '--with-io-romio-flags=--with-file-system=lustre+ufs'
                          '--without-psm' '--without-psm2' '--without-mxm'   
                          '--disable-mem-debug' '--disable-mem-profile' 
                          '--disable-picky' '--disable-debug'          
                          '--disable-timing' '--disable-event-debug'  
                          '--disable-memchecker' '--disable-pmix-timing'
                          '--with-mpi-param-check=runtime'             
                          '--with-oshmem-param-check=never'             
                          '--without-valgrind'                                                                                                                                        [144/625]
                Built by: apps
                Built on: Mon Aug  1 05:04:18 UTC 2022
              Built host: gadi-cpu-clx-2915.gadi.nci.org.au
              C bindings: yes
            C++ bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran -march=broadwell
                          compiler and/or Open MPI, does not support the
                          following: array subsections, direct passthru
                          (where possible) to underlying Open MPI's C
                          functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: disabled
              C compiler: gcc -march=broadwell
     C compiler absolute: /opt/nci/bin/gcc
  C compiler family name: GNU
      C compiler version: 8.5.0
            C++ compiler: g++ -march=broadwell
   C++ compiler absolute: /opt/nci/bin/g++
           Fort compiler: gfortran -march=broadwell
       Fort compiler abs: /opt/nci/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: yes
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64                                                                                                                                                                           
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256                                                                                                                                                                          
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.1.4)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.1.4)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.1.4)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.1.4)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.1.4)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.1.4)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: tm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: tm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.1.4)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.1.4)
1 Like

Looks like openmpi will fix this, so ill make as closed :slight_smile:

2 Likes