Payu generated symlinks don't work with ParallelIO library

I am posting here because I don’t know where it fits better!

I am trying to configure CICE in OM3 to use Parallel IO (which supports Netcdf 4, rather than the current Netcdf Classic - Parallel IO in CICE · Issue #81 · COSIMA/access-om3 · GitHub). This is configurable in Nuopc - by changing nuopc.runconfig here.

When an attempt to parallel read a restart file is made, we get this error:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832

This get_stripe failed is because payu provides a symlink to the restart file rather than the file. The payu ‘work’ directiory has an ‘input’ folder with symlinks ( e.g. input/iced.1900-01-01-10800.nc -> /g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc)

Trying this bash shows what is going on:

$ lfs getstripe input/iced.1900-01-01-10800.nc 
input/iced.1900-01-01-10800.nc has no stripe info

when the expected result is

$ lfs getstripe /g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc
/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/iced.1900-01-01-10800.nc
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 6
	obdidx		 objid		 objid		 group
	     6	     138480481	    0x8410b61	   0x3c0000400

I have always thought of symlinks as pretty robust!

Options here are possibly:

  • Update payu to point directly to the file instead of symlinking (would this need a copy of the restart files in the work directory?)
  • Raise the issue with the developers of the ParrallelIO library
  • Something else ?

I haven’t yet thought through how updating payu would work. We wouldn’t want to have to make a copy of every restart file, every time a model component run by payu is initialised.

I have focussed on testing this with CICE & NUOPC, but every model component run by Payu will have the same issue. To update to Netcdf4 and parallel reads for any other component (I tested with the data-atmosphere but all would be impacted) some change will need to be made.

1 Like

I agree this is a decent place to talk about it, as it involves a lot more than just payu.

So why does PIO in CICE5 work with ACCESS-OM2 with symlinks? Is nuopc doing some other step that involves interrogating the striping?

This would constitute a pretty large change in the logic of payu and would be an option of last resort.

I’d plump for figuring out if you can turn off this ifs interrogation step. My recollection from Nic Hannah’s testing was that he didn’t get much IO improvement when he changed the default PIO configuration.

@rui.yang is the one who knows about optimising Lustre striping though.

Yeah this is curious. Ill keep investigating.

The big change was probably from switching to Parallel IO, the configuration of it is probably less important as long as it is reasonable (i.e. 12 vs 24 threads for IO might not change much in the final result due to the limitations in disk access).

Similarly, I expect the default configuration for striping will be fine at our file sizes. It just needs to work.

To address something you mentioned in your original post, IIRC there was no discernible difference in performance for 1 deg and even 0.25 deg wasn’t much. It was the tenth that it really made a difference, but @aekiss might recall, or have some actual numbers.

There is some useful stuff in the TWG minutes, search for PIO and start sifting for nuggets of wisdom:

https://cosima.org.au/index.php/category/minutes/

(Feeling vindication for the time it took me to write those damn things)

I am so full of it. cice5 already does this because the driver wanted to modify restarts (and inputs I guess) in-place

So if you needed a work-around it might be your ticket. Would be good not to need it though.

I should really have remembered this, I wrote it. I sort of did, but thought it didn’t apply, so apologies if this has held you up.

(Just want to say, how nice are those one-box code snippet previews. Nice one discourse!)

Ok, great! I was starting to get mighty confused about where the difference from OM2 to OM3 was.

Do I need to build my own payu to to test that? (Are there instructions for building payu?)

This step is buried deep in in the open-mpi library:

Which makes it kind of challenging to do anything about. Fortran also doesn’t have a good way to just read the path it’s trying to point the symlink to.

We could:

  • Update the rpointer files in Payu to use full paths for the restart and input files (instead of using the symlinks).
  • Copy the restart files (as above) in Payu
  • Patch CICE to read using serial input and only write using parallel output. (IO is controlled in nuopc.run config, so it’s messy). This would mean all model components would have to use serial input and have a similar patch applied if we wanted parallel output (I don’t know if that would be slow, the datm files are a lot bigger than the CICE ones but still less than 2GB each).
  • Possibly raise an issue with open-mpi, and make the case the behaviour with Lustre is wrong and they should be checking for symlinks when opening files.

I don’t think there are. I usually load a vanilla conda environment (one without payu):

module use /g/data/hh5/public/modules
module load conda/python3

and then

pip install -e . --user

in the payu source directory.

That installs into ~/.local/bin/payu.

Wow. Good find.

That would break the manifests as they’re currently implemented.

That’s a decent work-around for the time being. Does require a patched payu however.

I think this is worth doing regardless.