Outputting a region using MOM6 diag_table - rather than a single file, I am getting many (one per processor?)

Hello - I am trying the option of outputting a “region” in diag_table. I’ve selected a 5x3 degree box and am outputting daily 3d fields. I seem to be getting multiple files (one per processor?) per variable, when I expect one, e.g.,

access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0250
access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0251
access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0252

Is this expected? My diag_table is here se-aus-regional-om3-ryf/diag_table at expt · mmr0/se-aus-regional-om3-ryf · GitHub

1 Like

Hm, looks like they aren’t being collated automatically by payu.

What’s in the “collation” section of your config.yaml?

There may also be some clues in the *_c.o* and *_c.e* files.

1 Like

Annoyingly, yes. You have set IO_LAYOUT = 1, 1 (as you have), which effectively collates your output when diagnostics are posted. However, your subregion doesn’t include the root processor, so FMS doesn’t have anywhere to gather all the outputs, and each rank will output its tile separately! As @aekiss alluded to, you’ll need to enable collation with a section something like the following (adjust as necessary) in your config.yaml:

collate:
  mpi: true
  walltime: 1:00:00
  mem: 30GB
  ncpus: 4
  queue: expresssr
  exe: /g/data/vk83/apps/mppnccombine-fast/0.2/bin/mppnccombine-fast
3 Likes

A small note that there is now an environment module for mppnccombine-fast in /g/data/vk83/modules. We’ve yet to update all our configs to use the module, but could you instead use the following in your config:

modules:
    use:
        - /g/data/vk83/modules
    load:
        - model-tools/mppnccombine-fast/2025.07.000

collate:
  mpi: true
  walltime: 1:00:00
  mem: 30GB
  ncpus: 4
  queue: expresssr
  exe: mppnccombine-fast
3 Likes

Thanks @aekiss @angus-g @dougiesquire for explaining what was happening! I will try this :smiley:

I think I have a related issue.

I have added the collate to my config file and it is working fine when;

  1. I am outputting the data over the entire model domain, e.g. “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.z%4yr-%2mo”, “all”, “mean”, “none”, 2

  2. I am outputting the data over a very small region all within the same tile eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “113.45 113.75 -22.55 -22.35 -1 -1”, 2

However, if I have chosen a large subregion (one that covers multiple tiles but not all the tiles of the domain) eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “111 124 -23 -13 -1 -1”, 2

I get this error for the collate post processing. I think it because only some tiles are present. If I load individual tiles then the data in them looks fine, but I haven’t figured out how to combine the tiles into one file myself.

Loading access-om3/2025.08.001-tracers-from-file

  Loading requirement: access3/tracers-from-file-ycbexjc

Currently Loaded Modulefiles:

 1) access3/tracers-from-file-ycbexjc           5) openmpi/4.1.7-ushgfj4 

 2) access-om3/2025.08.001-tracers-from-file    6) pbs                   

 3) nco/5.0.5                                 

 4) model-tools/mppnccombine-fast/2025.07.000 

payu: error: Thread 0 crashed with error code 255.

 Error message:

 

Copying non-collated variables

 

Copying contiguous variables

 

Copying chunked variables

[rank 000] ERROR in HDF5 /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/spack-src/async.c:490

 

HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:

  #000: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 403 in H5Dopen2(): unable to synchronously open dataset

    major: Dataset

    minor: Can't open object

  #001: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 364 in H5D__open_api_common(): unable to open dataset

    major: Dataset

    minor: Can't open object

  #002: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1980 in H5VL_dataset_open(): dataset open failed

    major: Virtual Object Layer

    minor: Can't open object

  #003: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1947 in H5VL__dataset_open(): dataset open failed

    major: Virtual Object Layer

    minor: Can't open object

  #004: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLnative_dataset.c line 321 in H5VL__native_dataset_open(): unable to open dataset

    major: Dataset

    minor: Can't open object

  #005: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Dint.c line 1418 in H5D__open_name(): not found

    major: Dataset

    minor: Object not found

  #006: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 421 in H5G_loc_find(): can't find object

    major: Symbol table

    minor: Object not found

  #007: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 816 in H5G_traverse(): internal path traversal failed

    major: Symbol table

    minor: Object not found

  #008: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 596 in H5G__traverse_real(): traversal operator failed

    major: Symbol table

    minor: Callback failed

  #009: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 381 in H5G__loc_find_cb(): object 'xh_sub04' doesn't exist

    major: Symbol table

    minor: Object not found

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a841]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a522]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x405dff]

/half-root/usr/lib64/libc.so.6(__libc_start_main+0xe5)[0x148bb3cd07e5]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x404c0e]

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

  Proc: [[26850,1],0]

  Errorcode: -1

 

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

-------------------------------------------------------------------------- 

Thanks for the info Lizzie! I didn’t think it was an issue that tiles are missing, since that would happen on masked processors anyway. Although your subregion is somewhat more restricted than “most of the domain minus masking”, so perhaps something is going weird there. The errors being HDF5-related almost points to a file/filesystem issue? I wonder if the non-fast version of mppnccombine would treat the files any differently? I haven’t seen this specific error!

I was thinking the same thing. Definitely worth trying.

See this recent post for how to do this:

Thanks @angus-g and @Aidan

I can give that a go.
After
load:
- fre-nctools/2024.05-1
Would I also need to change or delete this section?
collate:
restart: true
mpi: true
walltime: 2:00:00
mem: 30GB
ncpus: 4
queue: expresssr
exe: mppnccombine-fast

Yes.

You need to set mpi: false and change the name of exe, e.g.

collate:
    exe: mppnccombine
    restart: true
    mpi: false
    walltime: 2:00:00
    mem: 30GB
    ncpus: 4
    queue: expresssr

You may need to fiddle with the memory and ncpus settings depending on how much you’re collating.

The reason the high resolution models use mppnccombine-fast is because it is much faster, but also uses a lot less memory IIRC.

I would test this on just the subregional outputs in the first instance to see if it works. If it does then you might need fiddle to get something that works for production if you’re running a high resolution model over a large domain, e.g. don’t automatically collate but create a post-processing script that use mppnccombine-fast for the normal diagnostics, and mppnccombine for the sub-regional ones.

Note: you can specify a specific directory to collate
e.g.

payu collate -d archive/restart000

Also this related issue popped up .. would @angus-g fix work here too?

It is simple to specify a different number of cpus so maybe worth a try?

Could be worth trying at least. That error is the one that was nominally fixed by just bumping the HDF5 version.

Is this bug fix documented somewhere? I searched but couldn’t find anything.

There was a discussion about it on Zulip but because the solution just involved updating dependencies I don’t think it was documented otherwise…

1 Like

I have tried both the non-fast and fast version of mppncombine with different number of CPUs. Changing the number of CPUs didn’t alter the error message for me.

When I used the non fast version, the error I got started like this, which is different to the error I got using the fast version.

payu: error: Thread 0 crashed with error code 9.
Error message:
ERROR: missing at least -m from the input file set.  Exiting.

payu: error: Thread 1 crashed with error code 9.
Error message:
ERROR: missing at least -m from the input file set.  Exiting.

payu: error: Thread 2 crashed with error code 9.
Error message:
ERROR: missing at least -m from the input file set.  Exiting.

payu: error: Thread 3 crashed with error code 9.
Error message:
ERROR: missing at least -m from the input file set.  Exiting.

payu: error: Thread 4 crashed with error code 9.
Error message:
ERROR: missing at least -m from the input file set.  Exiting.

Hi Lizze, I think

might solve the problem with the non-fast version (mppnccombine)

Thanks Anton

That seems to give a new error!

payu: error: Thread 0 crashed with error code -6.
Error message:
[gadi-cpu-spr-0182:952508:0:952508] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
corrupted size vs. prev_size

Thanks for sharing your files that cause the collation error @Lizzie. I am able to collate the single-variable file with mppnccombine, but I get the following for the mutli-variable file:

[gadi-login-04:1395861:0:1395861] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
==== backtrace (tid:1395861) ====
 0 0x000000000004e5b0 killpg()  ???:0
 1 0x00000000001aeb81 H5FL_blk_malloc()  ???:0
 2 0x00000000000930b9 H5B__cache_deserialize()  H5Bcache.c:0
 3 0x00000000000af94c H5C__load_entry()  H5Centry.c:0
 4 0x00000000000ae22b H5C_protect()  ???:0
 5 0x0000000000089527 H5AC_protect()  ???:0
 6 0x000000000008dd5b H5B_find()  ???:0
 7 0x00000000000d5308 H5D__btree_idx_get_addr()  H5Dbtree.c:0
 8 0x00000000000e625a H5D__chunk_lookup()  ???:0
 9 0x00000000000dc5a2 H5D__chunk_read()  H5Dchunk.c:0
10 0x000000000010c3c7 H5D__read()  ???:0
11 0x00000000003d702f H5VL__native_dataset_read()  ???:0
12 0x00000000003c0980 H5VL_dataset_read_direct()  ???:0
13 0x00000000000cdbf4 H5Dread()  ???:0
14 0x00000000000c2a2e NC4_get_vars()  ???:0
15 0x00000000000c1c26 NC4_get_vara()  ???:0
16 0x0000000000032a7f NC_get_vara()  ???:0
17 0x0000000000030949 nc_get_vara()  ???:0
18 0x0000000000029a36 ncvarget()  ???:0
19 0x0000000000406ce7 process_vars()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-fre-nctools-2024.05-1-b7s7yifeqhlj3ydibuwfp3ee6zgag2vl/spack-src/src/mpp-nccombine/mppnccombine.c:1478
20 0x0000000000405d0d process_file()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-fre-nctools-2024.05-1-b7s7yifeqhlj3ydibuwfp3ee6zgag2vl/spack-src/src/mpp-nccombine/mppnccombine.c:1251
21 0x00000000004044e7 main()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-fre-nctools-2024.05-1-b7s7yifeqhlj3ydibuwfp3ee6zgag2vl/spack-src/src/mpp-nccombine/mppnccombine.c:698
22 0x000000000003a7e5 __libc_start_main()  ???:0
23 0x000000000040388e _start()  ???:0
=================================
Segmentation fault

I just wanted to check that was also your experience because I don’t recall you menioning that some were successful.

Hi Dougie, thanks for looking.

No I didn’t even get the single variable mld file to collate and I haven’t seen that error message before!

to use mppnccombine I set up like this, but I tried with and without the flags and with some difference ncpus.

modules:
use:
- /g/data/vk83/modules
- /g/data/x77/ahg157/modules
load:

    - access-om3/2025.08.001-tracers-from-file 
    - nco/5.0.5
    - fre-nctools/2024.05-1

collate:
flags: -n4 -m -r
exe: mppnccombine
restart: true
mpi: false
walltime: 0:30:00
mem: 30GB
ncpus: 1
queue: expresssr

Getting a bit desperate, but can you avoid the error by modifying the MOM LAYOUT? I think I recall you are using AUTO_MASKTABLE = True so you would need to turn that off and follow these instructions to generate the computational mask.

The check_mask tool is available in the fre-nctools environment module:

$ module use /g/data/vk83/modules
$ module load fre-nctools/2024.05-1
$ which check_mask
/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/fre-nctools-2024.05-1-b7s7yifeqhlj3ydibuwfp3ee6zgag2vl/bin/check_mask

It could also be worth opening an issue with the people who write mppnccombine to see if they have any suggestions.