Payu collate tiles

I am running mom6_panan-005 and using the parallel I/O layout, so I need to collate the tiles afterwards, but it’s not working correctly. The tiles are collated into one, but the individual files are not deleted, so now I am using up unnecessary disk space.

I use this in config.yaml:

collate:
restart: true
mpi: true
walltime: 6:00:00
mem: 190GB
ncpus: 4
queue: normal
exe: /g/data/ik11/inputs/access-om2/bin/mppnccombine-fast

I also tried

exe: /g/data/ik11/inputs/access-om2/bin/mppnccombine

but that didn’t work either.

This is one of the error logs:

Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs  
payu: error: Thread 1 crashed with error code 255.
 Error message:

Copying non-collated variables

Copying contiguous variables

Copying chunked variables
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0008
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0009
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0010
[rank 003] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0008
[rank 001] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0010
[rank 002] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0009
[rank 003] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0008
[rank 002] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0009
[rank 001] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0010
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0011
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0012
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0022
[rank 001] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0011
[rank 002] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0012
[rank 003] Unaligned or compression change (slow) copy of umo_2d from 19920901.ocean_month.nc.0022
[rank 002] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0012
[rank 001] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0011
[rank 003] Unaligned or compression change (slow) copy of tauuo from 19920901.ocean_month.nc.0022
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0024
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0023
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month.nc.0044
[rank 001] Unaligned or compression change (slow) copy of yq from 19920901.ocean_month.nc.0044
[rank 000] var 3 yq from 3 dims 1 [0,4225648,72057594037928206]

[rank 000] ERROR in HDF5 /home/502/aph502/code/c/mppnccombine-fast/async.c:446

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
  #000: ../../src/H5Dio.c line 404 in H5Dwrite_chunk(): can't write unprocessed chunk data
    major: Dataset
    minor: Write failed
  #001: ../../src/H5Dchunk.c line 461 in H5D__chunk_direct_write(): unable to allocate chunk
    major: Dataset
    minor: Can't allocate space
  #002: ../../src/H5Dchunk.c line 6564 in H5D__chunk_file_alloc(): unable to free chunk
    major: Dataset
    minor: Unable to free object
  #003: ../../src/H5MF.c line 1216 in H5MF_xfree(): can't add section to file free space
    major: Resource unavailable
    minor: Unable to insert object
  #004: ../../src/H5MF.c line 665 in H5MF__add_sect(): can't re-add section to file free space
    major: Resource unavailable
    minor: Unable to insert object
  #005: ../../src/H5FSsection.c line 1409 in H5FS_sect_add(): can't insert free space section into skip list
    major: Free Space Manager
    minor: Unable to insert object
  #006: ../../src/H5FSsection.c line 1124 in H5FS_sect_link(): can't add section to non-size tracking data structures
    major: Free Space Manager
    minor: Unable to insert object
  #007: ../../src/H5FSsection.c line 1069 in H5FS_sect_link_rest(): can't insert free space node into merging skip list
    major: Free Space Manager
    minor: Unable to insert object
  #008: ../../src/H5SL.c line 1122 in H5SL_insert(): can't create new skip list node
    major: Skip Lists
    minor: Unable to insert object
  #009: ../../src/H5SL.c line 783 in H5SL_insert_common(): can't insert duplicate key
    major: Skip Lists
    minor: Unable to insert object
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x4095a1]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x409127]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x404311]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x1549e63abd85]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x4030ee]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

payu: error: Thread 4 crashed with error code 255.
 Error message:

Copying non-collated variables

Copying contiguous variables

Copying chunked variables
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0008
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0010
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0009
[rank 001] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0010
[rank 003] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0008
[rank 002] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0009
[rank 001] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0010
[rank 003] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0008
[rank 002] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0009
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0022
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0012
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0011
[rank 001] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0012
[rank 002] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0022
[rank 003] Unaligned or compression change (slow) copy of umo from 19920901.ocean_month_rho2.nc.0011
[rank 002] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0022
[rank 001] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0012
[rank 003] Unaligned or compression change (slow) copy of vmo from 19920901.ocean_month_rho2.nc.0011
[rank 001] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0024
[rank 003] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0044
[rank 002] Unaligned or compression change (slow) copy of xq from 19920901.ocean_month_rho2.nc.0023
[rank 003] Unaligned or compression change (slow) copy of yq from 19920901.ocean_month_rho2.nc.0044
[rank 000] var 3 yq from 2 dims 1 [0,4225648,72057594037928206]

[rank 000] ERROR in HDF5 /home/502/aph502/code/c/mppnccombine-fast/async.c:446

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
  #000: ../../src/H5Dio.c line 404 in H5Dwrite_chunk(): can't write unprocessed chunk data
    major: Dataset
    minor: Write failed
  #001: ../../src/H5Dchunk.c line 461 in H5D__chunk_direct_write(): unable to allocate chunk
    major: Dataset
    minor: Can't allocate space
  #002: ../../src/H5Dchunk.c line 6564 in H5D__chunk_file_alloc(): unable to free chunk
    major: Dataset
    minor: Unable to free object
  #003: ../../src/H5MF.c line 1216 in H5MF_xfree(): can't add section to file free space
    major: Resource unavailable
    minor: Unable to insert object
  #004: ../../src/H5MF.c line 665 in H5MF__add_sect(): can't re-add section to file free space
    major: Resource unavailable
    minor: Unable to insert object
  #005: ../../src/H5FSsection.c line 1409 in H5FS_sect_add(): can't insert free space section into skip list
    major: Free Space Manager
    minor: Unable to insert object
  #006: ../../src/H5FSsection.c line 1124 in H5FS_sect_link(): can't add section to non-size tracking data structures
    major: Free Space Manager
    minor: Unable to insert object
  #007: ../../src/H5FSsection.c line 1069 in H5FS_sect_link_rest(): can't insert free space node into merging skip list
    major: Free Space Manager
    minor: Unable to insert object
  #008: ../../src/H5SL.c line 1122 in H5SL_insert(): can't create new skip list node
    major: Skip Lists
    minor: Unable to insert object
  #009: ../../src/H5SL.c line 783 in H5SL_insert_common(): can't insert duplicate key
    major: Skip Lists
    minor: Unable to insert object
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x4095a1]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x409127]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x404311]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x149b3d138d85]
/g/data/ik11/inputs/access-om2/bin/mppnccombine-fast[0x4030ee]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

My rundirectory is

/home/142/cs6673/payu/panan_005deg_jra55_ryf_2023_05_17

Update: I checked some of the merged files, e.g. 19*.ocean_month_rho.nc and something has gone seriously wrong with the coordinates. Some variables don’t have any lon or lat coordinates or the lat coordinate (e.g. yq) has NaNs at the beginning. I have already run 4.5 years of panan-005 and really need to check the output to confirm it looks ok, so any help on how to collate the tiles correctly is much appreciated.

We have the same problem in the 1/40th panan simulation.

I’m looking into this now. When you say the non-fast version of mppnccombine “didn’t work”, were there any specific errors?

This is the beginning of the error message from /home/142/cs6673/payu/panan_005deg_jra55_ryf_2023_05_17/archive/pbs_logs/mom6_panan-00_c.e84150733

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)
payu: error: Thread 0 crashed with error code 1.
 Error message:
Illegal option -o

mppnccombine 2.2.5 - (written by Hans.Vahlenkamp)

Usage:  mppnccombine [-v] [-V] [-M] [-a] [-r] [-n #] [-k #] [-e #] [-h #] [-64] [-n4] [-m]
                     output.nc [input ...]

mppnccombine-fast was developed by @Scott to alleviate the very slow and memory intensive collation of mppnccombine for high resolution compressed outputs.

Put simply, it avoids uncompressing and recompressing tiled data by reading and writing the compressed chunks directly, and so can be very fast indeed. However it requires consistent chunking across the tiles for this to work.

There is a CLEX CMS blog post outlining how to use mppnccombine-fast, and technical documentation.

In any case mppnccombine-fast should still use significantly less memory than mppnccombine even if it is operating in “slow” mode, which it seems to be in your case.

If you change to using mppnccombine you can’t use the mpi: true option. Regardless it is likely that it will either take a very long time and use horrendous amounts of memory.

I’m not too across all the specifics, but I think there’s some kind of parallel gremlin lurking. I can reliably trigger the HDF5 abort by using 4 or 6 ranks, but I’ve been successful with 2, 3, 7, and 8. My guess is it’s some combination of the masking on tiles and the distribution of the files to process over the ranks? Will dig deeper to get a definitive answer though!

1 Like

I have been using 8 cpus now and didn’t have any HDF errors,

However, it doesn’t work completely fine all the time. While the tiles are always collated, the coordinate yq is half empty in some files for some months. I had to fix the coordinates in the following files which looks really random to me:

output078/*ocean_month.nc
output093/*ocean_month.nc
output093/*ocean_month_z.nc
output069/*ocean_static.nc
output072/*ocean_static.nc
output080/*ocean_static.nc
output083/*ocean_static.nc
output087/*ocean_static.nc
output091/*ocean_static.nc
output095/*ocean_static.nc