Recent test runs of ACCESS ESM1.6 encountered an openmpi bug that may be worth knowing about if you are running openmpi v4 based applications.
The model would occasionally fail with a file-not-found error in the CICE submodel:
CICE: ERROR failed to open input_ice.nml. Error code: 29 -
file not found, unit 11, file <laboratory-path>/work/ocean/input_ice.nml
Confusingly the error message pointed to the work/ocean directory, while the mpirun command from payu specified the work/ice directory:
mpirun -wdir <work>/atmosphere -np 208 <work>/atmosphere/um_hg3.exe :
-wdir <work>/ocean -np 196 <work>/ocean/mom5_access_cm :
-wdir <work>/ice -np 12 <work>/ice/cice_access-esm1.6_360x300_12x1_12p.exe
Printing the PWD and OMPI_MCA_initial_wdir environment variables during model initialisation showed that CICE used the wrong working directory in the failed runs:
Rank: 7 , pwd: <work>/ocean
CICE Rank: 7 , OMPI_MCA_initial_wdir: <work>/ocean`
From discussion with NCI, there appears to be a bug in the openmpi v4 series, where “occasionally one context gets the working directory of the previous context.” Testing with openmpi5.0.5 still encountered a similar issue but much less often.