I’m having a repeated error running the MOM6 regional model on the Setonix supercomputer.
Blockquote
MPICH ERROR: All nodes in this job do not contain the same number of NICs. Aborting job as performance may be affected.
Set MPICH_OFI_SKIP_NIC_SYMMETRY_TEST=1 to bypass this check.
MPICH ERROR [Rank 0] [job id 4893671.0] [Tue Oct 17 07:37:12 2023] [nid002025] - Abort(-1) (rank 0 in comm 0): Inconsistent number of NICs across the job (Other MPI error)
The error says that you can “Set MPICH_OFI_SKIP_NIC_SYMMETRY_TEST=1 to bypass this test”. Can anyone let me know how to do this?
1 Like
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
2
You can set, and pass, any environment variable using an option in config.yaml
Enable any environment variables required by mpirun during execution, such > s OMPI_MCA_coll. The following example below disables “matching > transport layer” and “collective algorithm” components:
Ok, scratch that. The model is complaining about the same error.
Interestingly enough, within the MOM_override file, there is a line:
#override MPICH_OFI_SKIP_NIC_SYMMETRY_TEST = 1
However, on a successful run, within the mom.out file, I found:
WARNING from PE 0: Unused line in MOM_override : override MPICH_OFI_SKIP_NIC_SYMMETRY_TEST = 1
Is there an obvious way to turn this flag on within the MOM_override file that I’m missing? The MOM6 documentation here is somewhat incomplete.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
8
payu dumps the environment as a dictionary to a yaml file (env.yaml). I’d suggest checking that when you try the various approaches to see if it has managed to propagate to the PBS job.
This isn’t a MOM6 thing, it’s a Setonix thing. I think it’s a harmless warning, but you could probably raise it with their support if it’s concerning you.
I’ve checked through the env.yaml files and it seems that the environment the MPICH_OFI_SKIP_NIC_SYMMETRY_TEST variable has NOT been propagated through, either from the MOM_override file or from directly setting it in the terminal.
The error is somewhat random: sometimes I get it, sometimes not. I assume that, by chance, I fall on a node configuration that the model and/or setonix is happy with. I’ll chase it with Pawsey support. Thanks.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
11
Does it cause your job to fail when it is tripped?