Working directory bug in openmpi version 4

Recent test runs of ACCESS ESM1.6 encountered an openmpi bug that may be worth knowing about if you are running openmpi v4 based applications.

The model would occasionally fail with a file-not-found error in the CICE submodel:

CICE: ERROR failed to open input_ice.nml. Error code:  29  - 
file not found, unit 11, file <laboratory-path>/work/ocean/input_ice.nml

Confusingly the error message pointed to the work/ocean directory, while the mpirun command from payu specified the work/ice directory:

mpirun  -wdir <work>/atmosphere -np 208 <work>/atmosphere/um_hg3.exe : 
-wdir <work>/ocean -np 196 <work>/ocean/mom5_access_cm : 
-wdir <work>/ice -np 12  <work>/ice/cice_access-esm1.6_360x300_12x1_12p.exe

Printing the PWD and OMPI_MCA_initial_wdir environment variables during model initialisation showed that CICE used the wrong working directory in the failed runs:

Rank:            7 , pwd: <work>/ocean
CICE Rank:       7 , OMPI_MCA_initial_wdir:  <work>/ocean`

From discussion with NCI, there appears to be a bug in the openmpi v4 series, where “occasionally one context gets the working directory of the previous context.” Testing with openmpi5.0.5 still encountered a similar issue but much less often.

1 Like