Hello - I am trying the option of outputting a “region” in diag_table. I’ve selected a 5x3 degree box and am outputting daily 3d fields. I seem to be getting multiple files (one per processor?) per variable, when I expect one, e.g.,
Annoyingly, yes. You have set IO_LAYOUT = 1, 1 (as you have), which effectively collates your output when diagnostics are posted. However, your subregion doesn’t include the root processor, so FMS doesn’t have anywhere to gather all the outputs, and each rank will output its tile separately! As @aekiss alluded to, you’ll need to enable collation with a section something like the following (adjust as necessary) in your config.yaml:
A small note that there is now an environment module for mppnccombine-fast in /g/data/vk83/modules. We’ve yet to update all our configs to use the module, but could you instead use the following in your config:
I have added the collate to my config file and it is working fine when;
I am outputting the data over the entire model domain, e.g. “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.z%4yr-%2mo”, “all”, “mean”, “none”, 2
I am outputting the data over a very small region all within the same tile eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “113.45 113.75 -22.55 -22.35 -1 -1”, 2
However, if I have chosen a large subregion (one that covers multiple tiles but not all the tiles of the domain) eg “ocean_model_z”, “uo”, “uo”, “access-om3.mom6.h.nd.hourly.z%4yr-%2mo”, “all”, “mean”, “111 124 -23 -13 -1 -1”, 2
I get this error for the collate post processing. I think it because only some tiles are present. If I load individual tiles then the data in them looks fine, but I haven’t figured out how to combine the tiles into one file myself.
Loading access-om3/2025.08.001-tracers-from-file
Loading requirement: access3/tracers-from-file-ycbexjc
Currently Loaded Modulefiles:
1) access3/tracers-from-file-ycbexjc 5) openmpi/4.1.7-ushgfj4
2) access-om3/2025.08.001-tracers-from-file 6) pbs
3) nco/5.0.5
4) model-tools/mppnccombine-fast/2025.07.000
payu: error: Thread 0 crashed with error code 255.
Error message:
Copying non-collated variables
Copying contiguous variables
Copying chunked variables
[rank 000] ERROR in HDF5 /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/spack-src/async.c:490
HDF5-DIAG: Error detected in HDF5 (1.14.2) MPI-process 0:
#000: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 403 in H5Dopen2(): unable to synchronously open dataset
major: Dataset
minor: Can't open object
#001: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5D.c line 364 in H5D__open_api_common(): unable to open dataset
major: Dataset
minor: Can't open object
#002: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1980 in H5VL_dataset_open(): dataset open failed
major: Virtual Object Layer
minor: Can't open object
#003: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLcallback.c line 1947 in H5VL__dataset_open(): dataset open failed
major: Virtual Object Layer
minor: Can't open object
#004: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5VLnative_dataset.c line 321 in H5VL__native_dataset_open(): unable to open dataset
major: Dataset
minor: Can't open object
#005: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Dint.c line 1418 in H5D__open_name(): not found
major: Dataset
minor: Object not found
#006: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 421 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#007: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 816 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#008: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gtraverse.c line 596 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#009: /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-hdf5-1.14.2-oum3uh3ht4on2t6x6wpicwhhvn3kciuj/spack-src/src/H5Gloc.c line 381 in H5G__loc_find_cb(): object 'xh_sub04' doesn't exist
major: Symbol table
minor: Object not found
/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a841]
/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x40a522]
/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x405dff]
/half-root/usr/lib64/libc.so.6(__libc_start_main+0xe5)[0x148bb3cd07e5]
/g/data/vk83/apps/spack/0.22/release/linux-rocky8-x86_64_v4/intel-2021.10.0/mppnccombine-fast-2025.07.000-6kyw6dwnfpnz3k74v3dkjmzkglpdxvq5/bin/mppnccombine-fast[0x404c0e]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[26850,1],0]
Errorcode: -1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Thanks for the info Lizzie! I didn’t think it was an issue that tiles are missing, since that would happen on masked processors anyway. Although your subregion is somewhat more restricted than “most of the domain minus masking”, so perhaps something is going weird there. The errors being HDF5-related almost points to a file/filesystem issue? I wonder if the non-fast version of mppnccombine would treat the files any differently? I haven’t seen this specific error!
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
8
I was thinking the same thing. Definitely worth trying.
I can give that a go.
After load: - fre-nctools/2024.05-1
Would I also need to change or delete this section? collate: restart: true mpi: true walltime: 2:00:00 mem: 30GB ncpus: 4 queue: expresssr exe: mppnccombine-fast
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
10
Yes.
You need to set mpi: false and change the name of exe, e.g.
You may need to fiddle with the memory and ncpus settings depending on how much you’re collating.
The reason the high resolution models use mppnccombine-fast is because it is much faster, but also uses a lot less memory IIRC.
I would test this on just the subregional outputs in the first instance to see if it works. If it does then you might need fiddle to get something that works for production if you’re running a high resolution model over a large domain, e.g. don’t automatically collate but create a post-processing script that use mppnccombine-fast for the normal diagnostics, and mppnccombine for the sub-regional ones.
Note: you can specify a specific directory to collate
e.g.
payu collate -d archive/restart000
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
11
Also this related issue popped up .. would @angus-g fix work here too?
It is simple to specify a different number of cpus so maybe worth a try?