Outputting a region using MOM6 diag_table - rather than a single file, I am getting many (one per processor?)

Hello - I am trying the option of outputting a “region” in diag_table. I’ve selected a 5x3 degree box and am outputting daily 3d fields. I seem to be getting multiple files (one per processor?) per variable, when I expect one, e.g.,

access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0250
access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0251
access-om3.mom6.3d.region.uo.z.1day.mean._2170.nc.0252

Is this expected? My diag_table is here se-aus-regional-om3-ryf/diag_table at expt · mmr0/se-aus-regional-om3-ryf · GitHub

1 Like

Hm, looks like they aren’t being collated automatically by payu.

What’s in the “collation” section of your config.yaml?

There may also be some clues in the *_c.o* and *_c.e* files.

1 Like

Annoyingly, yes. You have set IO_LAYOUT = 1, 1 (as you have), which effectively collates your output when diagnostics are posted. However, your subregion doesn’t include the root processor, so FMS doesn’t have anywhere to gather all the outputs, and each rank will output its tile separately! As @aekiss alluded to, you’ll need to enable collation with a section something like the following (adjust as necessary) in your config.yaml:

collate:
  mpi: true
  walltime: 1:00:00
  mem: 30GB
  ncpus: 4
  queue: expresssr
  exe: /g/data/vk83/apps/mppnccombine-fast/0.2/bin/mppnccombine-fast
3 Likes

A small note that there is now an environment module for mppnccombine-fast in /g/data/vk83/modules. We’ve yet to update all our configs to use the module, but could you instead use the following in your config:

modules:
    use:
        - /g/data/vk83/modules
    load:
        - model-tools/mppnccombine-fast/2025.07.000

collate:
  mpi: true
  walltime: 1:00:00
  mem: 30GB
  ncpus: 4
  queue: expresssr
  exe: mppnccombine-fast
3 Likes

Thanks @aekiss @angus-g @dougiesquire for explaining what was happening! I will try this :smiley:

I think I have a related issue.

I have added the collate to my config file and it is working fine when;

  1. I am outputting the data over the entire model domain, e.g. “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.z%4yr-%2mo”, “all”, “mean”, “none”, 2

  2. I am outputting the data over a very small region all within the same tile eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “113.45 113.75 -22.55 -22.35 -1 -1”, 2

However, if I have chosen a large subregion (one that covers multiple tiles but not all the tiles of the domain) eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “111 124 -23 -13 -1 -1”, 2

I get this error for the collate post processing. I think it because only some tiles are present. If I load individual tiles then the data in them looks fine, but I haven’t figured out how to combine the tiles into one file myself.

Loading access-om3/2025.08.001-tracers-from-file

  Loading requirement: access3/tracers-from-file-ycbexjc

Currently Loaded Modulefiles:

 1) access3/tracers-from-file-ycbexjc           5) openmpi/4.1.7-ushgfj4 

 2) access-om3/2025.08.001-tracers-from-file    6) pbs                   

 3) nco/5.0.5                                 

 4) model-tools/mppnccombine-fast/2025.07.000 

payu: error: Thread 0 crashed with error code 255.

 Error message:

 

Copying non-collated variables

 

Copying contiguous variables

 

Copying chunked variables

[rank 000] ERROR in HDF5 /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/spack-src/async.c:490

 

HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:

  #000: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 403 in H5Dopen2(): unable to synchronously open dataset

    major: Dataset

    minor: Can't open object

  #001: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 364 in H5D__open_api_common(): unable to open dataset

    major: Dataset

    minor: Can't open object

  #002: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1980 in H5VL_dataset_open(): dataset open failed

    major: Virtual Object Layer

    minor: Can't open object

  #003: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1947 in H5VL__dataset_open(): dataset open failed

    major: Virtual Object Layer

    minor: Can't open object

  #004: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLnative_dataset.c line 321 in H5VL__native_dataset_open(): unable to open dataset

    major: Dataset

    minor: Can't open object

  #005: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Dint.c line 1418 in H5D__open_name(): not found

    major: Dataset

    minor: Object not found

  #006: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 421 in H5G_loc_find(): can't find object

    major: Symbol table

    minor: Object not found

  #007: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 816 in H5G_traverse(): internal path traversal failed

    major: Symbol table

    minor: Object not found

  #008: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 596 in H5G__traverse_real(): traversal operator failed

    major: Symbol table

    minor: Callback failed

  #009: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 381 in H5G__loc_find_cb(): object 'xh_sub04' doesn't exist

    major: Symbol table

    minor: Object not found

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a841]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a522]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x405dff]

/half-root/usr/lib64/libc.so.6(__libc_start_main+0xe5)[0x148bb3cd07e5]

/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x404c0e]

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

  Proc: [[26850,1],0]

  Errorcode: -1

 

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

-------------------------------------------------------------------------- 

Thanks for the info Lizzie! I didn’t think it was an issue that tiles are missing, since that would happen on masked processors anyway. Although your subregion is somewhat more restricted than “most of the domain minus masking”, so perhaps something is going weird there. The errors being HDF5-related almost points to a file/filesystem issue? I wonder if the non-fast version of mppnccombine would treat the files any differently? I haven’t seen this specific error!

I was thinking the same thing. Definitely worth trying.

See this recent post for how to do this:

Thanks @angus-g and @Aidan

I can give that a go.
After
load:
- fre-nctools/2024.05-1
Would I also need to change or delete this section?
collate:
restart: true
mpi: true
walltime: 2:00:00
mem: 30GB
ncpus: 4
queue: expresssr
exe: mppnccombine-fast

Yes.

You need to set mpi: false and change the name of exe, e.g.

collate:
    exe: mppnccombine
    restart: true
    mpi: false
    walltime: 2:00:00
    mem: 30GB
    ncpus: 4
    queue: expresssr

You may need to fiddle with the memory and ncpus settings depending on how much you’re collating.

The reason the high resolution models use mppnccombine-fast is because it is much faster, but also uses a lot less memory IIRC.

I would test this on just the subregional outputs in the first instance to see if it works. If it does then you might need fiddle to get something that works for production if you’re running a high resolution model over a large domain, e.g. don’t automatically collate but create a post-processing script that use mppnccombine-fast for the normal diagnostics, and mppnccombine for the sub-regional ones.

Note: you can specify a specific directory to collate
e.g.

payu collate -d archive/restart000

Also this related issue popped up .. would @angus-g fix work here too?

It is simple to specify a different number of cpus so maybe worth a try?

Could be worth trying at least. That error is the one that was nominally fixed by just bumping the HDF5 version.

Is this bug fix documented somewhere? I searched but couldn’t find anything.

There was a discussion about it on Zulip but because the solution just involved updating dependencies I don’t think it was documented otherwise…

1 Like