Using xp65 in UM suites

Hi everyone,

I’m having trouble using the xp65 analysis3 environments in the UM RNS, specifically using iris. My suite (u-de207) includes some extra python scripts to post-process the model output, which I’m trying to update to use xp65 instead of hh5 (a bit late I know!)

I’m running the suite with cylc8 (v3.3) in compatibility mode rather than cylc7, as xp65 analysis3 environments > 25.05 aren’t compatible with cylc7 but environments < 25.08 do not have all of the packages I need (iris, mule, xesmf).

iris.load_cube(file_path) works fine outside of the suite but fails with the below error when the script is run from the suite. It looks like it might be a problem with loading the netCDF4 package but I’m not sure?

I’ve updated all references to hh5 to xp65 in the site/nci-gadi/suite-adds.rc and suite-runtime/lams.rc but not having much luck. Any help would be amazing!

Thank you,
Bec

From job.err:

Using the cylc session localhost

Loading cylc/8.3.3
  Loading requirement: mosrs-setup/2.0.1
Loading conda/analysis3-25.08
  Loading requirement: singularity
Traceback (most recent call last):
  File "/home/578/rj9627/cylc-run/u-de207/run1/app/setup_ereefs/bin/get_daily_sst_ancil.py", line 131, in <module>
    main(args)    
    ^^^^^^^^^^
  File "/home/578/rj9627/cylc-run/u-de207/run1/app/setup_ereefs/bin/get_daily_sst_ancil.py", line 42, in main
    ancil_cube = iris.load_cube(ancil_dir+ancil_file).extract(constrain_date)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/loading.py", line 201, in load_cube
    cubes = _load_collection(uris, constraints, callback).combined().cubes()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/loading.py", line 125, in _load_collection
    from iris.fileformats.rules import _MULTIREF_DETECTION
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/fileformats/__init__.py", line 17, in <module>
    from . import name, netcdf, nimrod, pp, um
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/fileformats/netcdf/__init__.py", line 22, in <module>
    from .._nc_load_rules.helpers import UnknownCellMethodWarning, parse_cell_methods
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/fileformats/_nc_load_rules/helpers.py", line 38, in <module>
    import iris.fileformats.cf as cf
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/fileformats/cf.py", line 27, in <module>
    from iris.fileformats.netcdf import _thread_safe_nc
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/iris/fileformats/netcdf/_thread_safe_nc.py", line 15, in <module>
    import netCDF4
  File "/g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/netCDF4/__init__.py", line 3, in <module>
    from ._netCDF4 import *
ImportError: /g/data/xp65/public/apps/med_conda/envs/analysis3-25.08/lib/python3.11/site-packages/netCDF4/../../.././libmpi.so.40: undefined symbol: mca_common_sm_fini

Hi Bec, I’m the triager today. It looks like there is a conflict between the iris in the module environment and some MPI libraries. I’ll find someone relevant to help with this case for you.

1 Like

Hi Bec @BecJackson,

checking out u-de207 it still has references to hh5, so I believe you are referring to a local version that hasn’t been checked in MOSRS yet.
Would you be able to check in the version you are currently trying to run, so I can inspect it and try to run it myself?

Thank you
Davide

1 Like

Hi Davide,

The hh5 → xp65 changes are checked in now.

The extra apps in the suite process the atmospheric model output for use by the eReefs coastal ocean model, which is also run from the suite (to automate the steps to create forcing files and run the ocean model each day). The first task (setup_ereefs) is where the suite is now failing when trying to use iris.load_cube.

Thank you!
Bec

1 Like

Hi @BecJackson,

The issue is related to the openmpi libraries that are being loaded when running import netCDF4.

Error reproduction

I was able to reproduce the error outside the suite:

module purge
module use /g/data/xp65/public/modules
module load conda/analysis3-25.08
python -c "import netCDF4" # no error
module load openmpi/4.0.1
python -c "import netCDF4" # error

How it happens in the suite

The problem is that the setup_ereefs task inherits from HOST_HPC, whose init-script is PRE_COMMAND().
Within the PRE_COMMAND macro, openmpi/4.0.1 is loaded and conda/analysis3-25.08 is loaded a few lines down.

This generates the error reproduced in the Error reproduction paragraph above.

Why it happens

The netCDF4 python package in the conda/analysis3 environments is built with MPI support, and linked to specific MPI libraries.
From the outputs of:

module purge
module load conda/analysis3-25.08
LD_DEBUG=libs python -c "import netCDF4" 2>&1 | grep openmpi

we can see these are in /g/data/xp65/public/apps/openmpi/4.1.6.
The version of openmpi is 4.1.6.

This newer version (4.1.6) most likely includes newer libraries that are not included in the openmpi version loaded by the suite (4.0.1). Therefore, the error “undefined symbol: mca_common_sm_fini” (which basically means it cannot find the mca_common_sm_fini library).

Solution

A general solution, would be to load a newer version of openmpi insteaf of 4.0.1 in PRE_COMMAND macro.
There is no openmpi/4.1.6 version on Gadi, but the openmpi/4.1.7 would do the job.

This will solve the problems with the setup_ereefs task, but there might be other tasks (those that inherit from HOST_HPC) that need MPI support with a specific version.

If loading openmpi/4.1.7 in the PRE_COMMAND macro generates MPI-related errors in other tasks, then the easiest solution would be unsetting the LD_LIBRARY_PATH env variable only for the setup_ereefs task (loading the openmpi module sets LD_LIBRARY_PATH to a specific path).
This can be done by modifying setup_ereefs [[[environment]]]:

[[[environment]]]
    ancil_dir = {{coupled_ancil_dir}}
    ancil_fname = {{ancil_templates}}
    ancil_dout = $CYLC_SUITE_RUN_DIR/share/ereefs_data/ancils
+   LD_LIBRARY_PATH =

Hope this helps

Cheers
Davide

1 Like

Thanks for taking a look @atteggiani!

Both solutions worked great for setup_ereefs. It looks like the UM tasks require openmpi/4.0.2 (loaded in the UM_ENV macro), so I will go with unsetting LD_LIBRARY_PATH in the eReefs tasks.

However, the suite is now running into an error at um_recon, with a segmentation fault related to mpirun. Any ideas what the “Address not mapped” error message might be referring to?

From job.err:

[gadi-cpu-clx-1447:1830174] *** Process received signal ***
[gadi-cpu-clx-1447:1830174] Signal: Segmentation fault (11)
[gadi-cpu-clx-1447:1830174] Signal code: Address not mapped (1)
[gadi-cpu-clx-1447:1830174] Failing at address: (nil)
[gadi-cpu-clx-1447:1830174] [ 0] /lib64/libc.so.6(+0x4e5b0)[0x14e1bad575b0]
[gadi-cpu-clx-1447:1830174] *** End of error message ***
/g/data/xp65/public/apps/med_conda_scripts/analysis3-25.08.d/bin/mpirun: line 165: 1830145 
Segmentation fault      "$SINGULARITY_BINARY_PATH" -s exec --bind "${bind_str}" ${overlay_args} "${CONTAINER_PATH}" "${cmd_to_run[@]}"
[FAIL] um-recon # return-code=139
2025-09-23T05:01:54Z CRITICAL - failed/EXIT

Thank you!!

1 Like

I suspect that you will need to recompile the UM executables with the new OpenMPI libraries.

Reading the post does not help me determine if you do build the executables in your experiement (but I expect you would). I recall getting SegFaults when dealing with a simialr issue for ACCESS-S2.

I apologise if this is not helpful.

Hi @BecJackson,

I’m not entirely sure what might have caused this error.

However, it might be connected to a memory pointer to a file not being found (“Address not mapped”). Looking at the output files for the task that failed (glm_um_recon1), I can see the last line of the STDOUT log (/scratch/p66/rj9627/cylc-run/u-de207/run1/log/job/20211231T0000Z/glm_um_recon1/01/job.out being:

227 Could not find PE0 output file: pe_output/umgla.fort6.pe000

This could be the reason of the Segmentation fault: the code had set a memory pointer to a file that, when it was expected to be read, could not be found. (I’m not completely sure, but from the logs this might have happened).

Reasons might be connected to Griff’s answer, or might not. I’m honestly not sure.

What I would try, is rerun the suite from scratch (delete the /scratch/$PROJECT/$USER/cylc-run/u-de207 directory first) and see if you still get this error.

Unfortunately I cannot test it directly because I’m not a member of ih54.

Cheers
Davide

1 Like

Hi @griff,

Thanks for the advice! BUILD_MODE is set to ‘Build new executable’ in the suite, so I think the executable is re-compiled when I start a new run.

Following @atteggiani’s suggestion, I tested unsetting LD_LIBRARY_PATH in the recon_resources macro, which is inherited by the um_recon tasks. This seems to have fixed the seg fault issue and the suite is now running fine… but I’m not sure if this is the best way to solve the problem?

Cheers,
Bec

Thanks Davide, I did try re-running the suite from scratch but still got the segmentation fault.

I’ve since tried unsetting LD_LIBRARY_PATH in the recon_resources macro (above post) which seems to have worked but I’m not sure if this is the best approach?

The um_fcst tasks are running but the suite might still fail at another point… will leave it running and see how it goes!

1 Like

You can use different environments for the model and post-processing Cylc tasks, it doesn’t need to be the same environment for everything. Just override I think it’s the init scriptin the post-processing task’s Cylc config.

1 Like

Thanks @Scott :slight_smile: that’s good to know!

The extra processing tasks are running fine now (thanks @atteggiani!). My new problem was getting the rest of the UM tasks to run after updating the site/nci-gadi/suite-adds.rc to use the xp65 analysis3/25.08 env instead of hh5… but I think we may have a workaround for now!

Good to know!

I was wondering why do the UM tasks need the analysis3 environment in the first place?

I think it was needed to enable the model cycling in the Regional Nesting Suite - @Matt_Woodhouse would know more than me though!

Hi Team, I am now also facing the same error as reported here:

[INFO] mpirun /home/563/slf563/cylc-run/u-di850/share/fcm_make_um/build-recon/bin/um-recon.exe

Could not find PE0 output file: pe_output/umgla.fort6.pe000

I have tried unsetting the LD_LIBRARY_PATH in the recon_resources macro, but that has not solved the issue…

Also note that I am running on cylc7 with analysis3-24.07, so slightly different to Bec, but the suite I have had the same origin as hers.

You can see my changes here: https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=334416%40d%2Fi%2F8%2F5%2F0%2Ftrunk&old=304840%40d%2Fi%2F8%2F5%2F0%2Ftrunk

Any ideas would be appreciated!
Thanks!

1 Like

Hi @sonyafiddes,

Your error (and most likely the similar error above too) is not related to LD_LIBRARY.

How to find out the cause of your error

Untarring your suite first log file /home/563/slf563/cylc-run/u-di850/log.20251021T033307Z.tar.gz and checking the general suite log file /home/563/slf563/cylc-run/u-di850/log.20251021T033307Z/suite/log, we can see the first error was during the glm_um_recon1 task. An easy way to spot it is searching for CRITICAL, and you will see this at line 50:

2025-10-21T03:53:58Z CRITICAL - [glm_um_recon1.20240113T0000Z] status=running: (received)failed/EXIT at 2025-10-21T03:53:58Z for job    (01)

Then, if we look at the first glm_um_recon1 STDERR log for the 20240113T0000Z cycle (file /home/563/slf563/cylc-run/u-di850/log.20251021T033307Z/job/20240113T0000Z/glm_um_recon1/01/job.err) we see the reason of the error at line 25:

/g/data/xp65/public/apps/med_conda_scripts/analysis3-24.07.d/bin/mpirun: line 121: /g/data/xp65/public/apps/med_conda/envs/analysis3-24.07/bin/mpirun: No such file or directory

So, /g/data/xp65/public/apps/med_conda_scripts/analysis3-24.07.d/bin/mpirun is complaining that it cannot find /g/data/xp65/public/apps/med_conda/envs/analysis3-24.07/bin/mpirun (error coming from its line 121).

Reason of the error

The actual reason is complex and it involves the containerisation of the conda/analysis3 environments.
The short/quick reason is that the conda/analysis3 environment you load (analysis3-24.07) doesn’t mount its directory within its container, hence the “directory not found” error.

Solution

If using a different conda/analysis3 version would still be suitable to you, can you please try loading analysis3-25.09 instead? It should work with this version.

In any case, this is a bug that should not happen. I am going to investigate further with other people at ACCESS-NRI.

Thank you and @BecJackson for raising this!

2 Likes

Hello! Thanks so much for this.
I tried updating to 25.09 but then ran into similar issues documented here with the install_cold_hpc task: Analysis3-25.06 onwards incompatible with cylc7
As I am running with cylc7. I am now running with 25.05 instead and it seems to be running! Thanks again!

1 Like