Error with payu and loading modules

I’m trying to run CABLE with payu using an experiment @clairecarouge had made a while back: GitHub - ccarouge/cbl_bench_carbon: Inputs to run CABLE spatial benchmarking.

When running the payu run command, the following error was produced: (in /home/189/sb8430/cbl_bench_carbon/benchmark.e79641801)

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
MODULE ERROR DETECTED: GLOBALERR intel-mpi/2019.5.281 cannot be loaded due to a conflict.
(Detailed error information and backtrace has been suppressed, set $MODULES_ERROR_BACKTRACE to unsuppress.)

Loading intel-mpi/2019.5.281
  ERROR: intel-mpi/2019.5.281 cannot be loaded due to a conflict.
    HINT: Might try "module unload openmpi" first.
payu: Model exited with error code 127; aborting.

Does anyone know what is causing the error?

Steps taken:

  1. Clone the payu repository and checkout the CABLE model driver branch. Then install payu locally.
  2. Checkout the experiment and edit paths to exe, input and restart in the config file. This is what the config file looks like:
# PBS flags
queue: normal
walltime: 1:00:00
ncpus: 16
# mem: 64GB

# mpirun:
#    - --bind-to none

jobname: benchmark

# Model config
model: cable
exe: /home/189/sb8430/cable/trunk/offline/cable-mpi
input:
    - /g/data/tm70/ccc561/MetFiles/CRUNCEP
    - /home/189/sb8430/cbl_bench_carbon/inputs


collate: False

runspersub: 2

restart: /home/189/sb8430/cbl_bench_carbon/inputs

# Required option! Ensure the same run is repeated everytime the benchmark is done
repeat: True
  1. Make a cable.res.yaml file in the cbl_bench_carbon/inputs directory and copy the following contents into cable.res.yaml:
year: 1904
  1. Run payu setup from the experiment directory.
  2. Run payu sweep and then payu run

@Aidan do you know what is causing the error?

payu inspects your executable to try and guess what MPI library is required to run it. Effectively it is running ldd:

$ ldd /home/189/sb8430/cable/trunk/offline/cable-mpi | grep libmpi
	libmpifort.so.12 => /apps/intel-mpi/2019.5.281/intel64/lib/libmpifort.so.12 (0x000014ce80833000)
	libmpi.so.12 => /apps/intel-mpi/2019.5.281/intel64/lib/release/libmpi.so.12 (0x000014ce7f82e000)

The error says:

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  

and

ERROR: intel-mpi/2019.5.281 cannot be loaded due to a conflict.

so it seems that it doesn’t like that you’ve already got openmpi loaded. Can you try unloading it first? If that doesn’t work, check you aren’t doing any module load calls in your ~/.bashrc or equivalent.

I’m not sure how openmpi/4.1.4 is getting loaded before hand. I’ve purged my modules and I don’t have any module load calls in my ~/.bashrc file. Everything in my ~/.bashrc seems harmless:

# .bashrc

# Source global definitions (Required for modules)
if [ -f /etc/bashrc ]; then
	. /etc/bashrc
fi

if in_interactive_shell; then
    # This is where you put settings that you'd like in
    # interactive shells. E.g. prompts, or aliases
    # The 'module' command offers path manipulation that
    # will only modify the path if the entry to be added
    # is not already present. Use these functions instead of e.g.
    # PATH=${HOME}/bin:$PATH

    prepend_path PATH ${HOME}/bin
    prepend_path PATH ${HOME}/.local/bin
    
    if in_login_shell; then
	# This is where you place things that should only
	# run when you login. If you'd like to run a
	# command that displays the status of something, or
	# load a module, or change directory, this is the
	# place to put it
	# module load pbs
    module use /g/data/hh5/public/modules
    export SVN_EDITOR=vim
    prepend_path PATH /g/data/hh5/public/apps/nci_scripts # to use scripts such as uqstat
	# cd /scratch/${PROJECT}/${USER}
    fi

fi

# Anything here will run whenever a new shell is launched, which
# includes when running commands like 'less'. Commands that
# produce output should not be placed in this section.
#
# If you need different behaviour depending on what machine you're
# using to connect to Gadi, you can use the following test:
#
# if [[ $SSH_CLIENT =~ 11.22.33.44 ]]; then
#     Do something when I connect from the IP 11.22.33.44
# fi
#
# If you want different behaviour when entering a PBS job (e.g.
# a default set of modules), test on the $in_pbs_job variable.
# This will run when any new shell is launched in a PBS job,
# so it should not produce output
#
# if in_pbs_job; then
#     module purge
# fi

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.07/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.07/etc/profile.d/conda.sh" ]; then
        . "/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.07/etc/profile.d/conda.sh"
    else
        export PATH="/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.07/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

I have also tried adding a module purge line to my ~/.bashrc by uncommenting

# if in_pbs_job; then
#     module purge
# fi

but that doesn’t prevent the issue.

Try payu-run rather than payu run and report back what happens. This will run payu directly on the login node, which at least tests it without submitting to the queue and spawning a whole new process. It will probably run ok seeing as you’re only asking for 16 cpus, but don’t let it run to completion if the runtime is greater than a few minutes.

In addition you could also try setting the MODULES_ERROR_BACKTRACE environment variable to get more verbose output.

I’ve tried setting the MODULES_ERROR_BACKTRACE environment variable by running:

export MODULES_ERROR_BACKTRACE=unsuppress

But this didn’t seem to produce a more verbose error message when running payu-run, here is the output:

$ payu-run
laboratory path:  /scratch/tm70/sb8430/cable
binary path:  /scratch/tm70/sb8430/cable/bin
input path:  /scratch/tm70/sb8430/cable/input
work path:  /scratch/tm70/sb8430/cable/work
archive path:  /scratch/tm70/sb8430/cable/archive
nruns: 1 nruns_per_submit: 2 subrun: 1
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
Setting up cable
Checking exe and input manifests
Updating full hashes for 1 files in manifests/exe.yaml
Creating restart manifest
Updating full hashes for 11 files in manifests/restart.yaml
Writing manifests/restart.yaml
Writing manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
mod conda/analysis3-22.10
ERROR: Multiple (2) conda environments have been loaded, cannot unload with module
ERROR: Try 'conda deactivate' first

Unloading conda/analysis3-22.10
  ERROR: Module evaluation aborted
Currently Loaded Modulefiles:
 1) conda/analysis3-22.10(analysis:analysis3:default)   2) pbs   3) openmpi/4.1.4(default)
MODULE ERROR DETECTED: GLOBALERR intel-mpi/2019.5.281 cannot be loaded due to a conflict.
(Detailed error information and backtrace has been suppressed, set $MODULES_ERROR_BACKTRACE to unsuppress.)

Loading intel-mpi/2019.5.281
  ERROR: intel-mpi/2019.5.281 cannot be loaded due to a conflict.
    HINT: Might try "module unload openmpi" first.
git add /home/189/sb8430/cbl_bench_carbon/config.yaml
git add /home/189/sb8430/cbl_bench_carbon/cable.nml
git add manifests/input.yaml
git add manifests/restart.yaml
git add manifests/exe.yaml
git commit -am "2023-04-18 14:26:48: Run 0"
TODO: Check if commit is unchanged
mpirun  -np 16  /scratch/tm70/sb8430/cable/work/cbl_bench_carbon/cable-mpi
/home/189/sb8430/cbl_bench_carbon/cable.err /scratch/tm70/sb8430/cable/archive/cbl_bench_carbon/error_logs/cable.3494a6.err
/home/189/sb8430/cbl_bench_carbon/cable.out /scratch/tm70/sb8430/cable/archive/cbl_bench_carbon/error_logs/cable.3494a6.out
/home/189/sb8430/cbl_bench_carbon/job.yaml /scratch/tm70/sb8430/cable/archive/cbl_bench_carbon/error_logs/job.3494a6.yaml
/home/189/sb8430/cbl_bench_carbon/env.yaml /scratch/tm70/sb8430/cable/archive/cbl_bench_carbon/error_logs/env.3494a6.yaml
payu: Model exited with error code 127; aborting.

Is this a new error?

Regardless I reproduced your error and fixed it by adding this to the config.yaml:

mpi:
   module: intel-mpi

I’ve not tried using intel-mpi (we usually use openmpi) and payu does some stuff to “normalise” the module setup, and assumes openmpi is the MPI implementation, and so loads it. Some of that stuff needs to be looked into I’d say, but for the moment this is the work-around.

Created a payu issue about this

1 Like

After revisiting this issue, it turns out that the Intel MPI library is not supported by payu as payu requires a number of argument flags based on the OpenMPI implementation (this is stated here).

The solution was to recompile the executable using the OpenMPI implementation.