I’m having a repeated error running the MOM6 regional model on the Setonix supercomputer.
MPICH ERROR: All nodes in this job do not contain the same number of NICs. Aborting job as performance may be affected.
Set MPICH_OFI_SKIP_NIC_SYMMETRY_TEST=1 to bypass this check.
MPICH ERROR [Rank 0] [job id 4893671.0] [Tue Oct 17 07:37:12 2023] [nid002025] - Abort(-1) (rank 0 in comm 0): Inconsistent number of NICs across the job (Other MPI error)
The error says that you can “Set MPICH_OFI_SKIP_NIC_SYMMETRY_TEST=1 to bypass this test”. Can anyone let me know how to do this?
You can set, and pass, any environment variable using an option in
Enable any environment variables required by
mpirun during execution, such > s
OMPI_MCA_coll. The following example below disables “matching > transport layer” and “collective algorithm” components:
So in this case, I’d pass something like:
Yes. But be careful with indentation in yaml.
FYI you can check your yaml files with this website
Unfortunately, I couldn’t seem to get this working, despite trying every permutation of the call above within the config.yaml file.
What does seem to work is simply setting the environment variable within the shell:
Cheating? Inelegant? Yes. But the model is now chugging along quite happily, so I’ll take it!
In any case, thanks again for the pointers.
Good to hear you solved your problem, but what shell did you set this in?
I’m asking in case someone else wants to solve this problem, but also trying to understand how why the other approach didn’t work.
Ok, scratch that. The model is complaining about the same error.
Interestingly enough, within the MOM_override file, there is a line:
#override MPICH_OFI_SKIP_NIC_SYMMETRY_TEST = 1
However, on a successful run, within the mom.out file, I found:
WARNING from PE 0: Unused line in MOM_override : override MPICH_OFI_SKIP_NIC_SYMMETRY_TEST = 1
Is there an obvious way to turn this flag on within the MOM_override file that I’m missing? The MOM6 documentation here is somewhat incomplete.
payu dumps the environment as a dictionary to a yaml file (
env.yaml). I’d suggest checking that when you try the various approaches to see if it has managed to propagate to the PBS job.
This isn’t a MOM6 thing, it’s a Setonix thing. I think it’s a harmless warning, but you could probably raise it with their support if it’s concerning you.
I’ve checked through the env.yaml files and it seems that the environment the MPICH_OFI_SKIP_NIC_SYMMETRY_TEST variable has NOT been propagated through, either from the MOM_override file or from directly setting it in the terminal.
The error is somewhat random: sometimes I get it, sometimes not. I assume that, by chance, I fall on a node configuration that the model and/or setonix is happy with. I’ll chase it with Pawsey support. Thanks.
Does it cause your job to fail when it is tripped?
Yes… or more precisely, stops the job from starting.