Hi Atmos community,
I failed at the ‘glm_um_recon1’ step while running the UM RNS.
The job.err indicates:
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here’s some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
→ Returned “Error” (-1) instead of “Success” (0)
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
and job.out reports:
Could not find PE0 output file: pe_output/umgla.fort6.pe000
I previously encountered this bug during the hh5 to xp65 transition, where reverting to an older version of conda/analysis3 (25.05) solved it. However, that fix is no longer working.
Does anyone have suggestions on how to resolve this?
Thanks,
Zhangcheng
Hi Zhangcheng, I’ve been running with analysis3 24.09, which seems to be working (having previously had that error).
Hi Matt,
Thanks for the suggestions! It looks like I don’t have version 24.09 available in my conda environment—only 24.07, 24.11, and 25.**. I’ve tested both 24.07 and 24.11, but unfortunately, neither resolved the issue.
Would you mind sharing your branch id? I’d like to compare our setups and see if I can spot any key differences.
Cheers,
Zhangcheng
Hi Zhangcheng,
I have updated to xp65 and analyis3-24.09 in a nesting suite simulation that is currently running. I’m also still using cylc7.
I’ve updated the following basis suites to match my running suite, but haven’t tested them. They also include changes to the emissions files and boundary layer nucleation options, changes which you might also like to consider.
u-df869 - glm only nesting suite to generate start dumps
u-df510 - ancillary suite
u-df403 - nesting suite
Matt
1 Like
I think I meant 24.11, not .09
Though I now seem to be getting the error, having just had a suite complete successfully.
It seems like something in the environment has changed, specifically regarding the openmpi. Is anyone familiar with this issue?
Is there a reason you’re loading the conda/analysis module to run the RNS?
Hi Lachlan,
It appears the RNS requires the ‘pytz’ module to enable model cycling, as Bec has mentioned in this post Using xp65 in UM suites. I tested this by not loading the conda/analysis environment, which resulted in a
ModuleNotFoundError: No module named ‘pytz’.
The xp65 Conda environment overrides the openmpi which is loaded with module load openmpi/x.y.z (should be able to see this with echo $OPAL_PREFIX with the conda/analysis module loaded). You might be able to get around this by unsetting $OPAL_PREFIX (i.e. setting it to an empty string) after loading the Conda environment, but I’m not sure.
1 Like
Update: I load python3 instead of the default python2 in PRE_COMMAND and removed the conda/analysis. The RNS is now running successfully! Thank you for the suggestions @Matt_Woodhouse and @lachlanswhyborn .
3 Likes