PAYU issues on Setonix

john_reilly · 7 September 2023 04:55

Hi @Aidan and others,

As discussed yesterday - @ChrisC28 and I are having a few PAYU issues since the Setonix update on Tuesday this week.

A bit of a summary of the problem is:

Tried payu sweep and got error that “can’t find payu module”
Realised modules such as python, netcdf, hdf5, gcc all needed to be updated - e.g. Lmod has detected the following error: The following module(s) are unknown: "rclone/1.59.2"
Used module spider <modulename> then load the new versions of each model. These are now in my ~/.bashrc file.
With up-to-date filesystem modules, was able to pip install payu. Path to payu: /software/projects/pawsey0410/jreilly/setonix/python/payu/
After that, payu sweep worked, however when trying to run the model, I received a Segmentation Fault that was related to errors within payu python scripts: e.g., something like “missing argument to yaml.load(config.yaml)” which expected a loader so I changed this to "yaml.load(config.yaml, Safeloader) - That fixed that line, but then other’s popped up.

Apologies for the rough explanation but hopefully this will be enough to start the discussion again.

Cheers,
John

Aidan · 7 September 2023 05:07

Hey @john_reilly

Can you paste the most recent error in a reply. You can use the code formatting to make it more legible

Can you fork payu and push up the version of payu you’re using.

Can you also copy and paste the module load commands you have in your ~/.bashrc.

Thanks!

john_reilly · 7 September 2023 06:07

Below is the error output when trying payu run -n 1.

Looks like there’s a few issues happening…

payu: warning: MODULESHOME does not exist; disabling environment modules.
payu: warning: Environment modules unavailable; aborting reversion.
payu: warning: Job request includes 3 unused CPUs.
payu: warning: CPU request increased from 589 to 592
Traceback (most recent call last):
  File "/software/projects/pawsey0410/jreilly/setonix/python/bin/payu", line 8, in <module>
    cli.parse()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/cli.py", line 62, in parse
    run_cmd(**args)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/subcommands/run_cmd.py", line 97, in runcmd
    cli.submit_job('payu-run', pbs_config, pbs_vars)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/cli.py", line 214, in submit_job
    for k, v in pbs_vars.iteritems())
AttributeError: 'dict' object has no attribute 'iteritems'

The changes Chris and I have made should be in this fork: GitHub - reillyja/payu: A workflow management tool for numerical models on the NCI computing systems

And finally, the bashrc is:


module load PrgEnv-gnu

module load netcdf-c/4.9.0  netcdf-fortran/4.6.0
module load gcc/12.2.0  cray-hdf5/1.12.2.3 cray-netcdf/4.9.0.3
module load python/3.10.10 py-pip/23.1.2-py3.10.10 py-setuptools/68.0.0-py3.10.10
module load cray-mpich/8.1.19 cray-hdf5-parallel/1.12.2.3

Thanks Aidan. Hopefully the above helps

Aidan · 7 September 2023 11:55

That is very useful, thanks.

This error is because the version of payu you’re using isn’t compatible with python3

AttributeError: 'dict' object has no attribute 'iteritems'

This is weird, as payu was updated for python3 in this commit

github.com/payu-org/payu

Removal of "reversion"; Python 2.6 and 3.x support

committed 05:35AM - 10 Oct 18 UTC

marshallward

+167 -156

This patch eliminates the explicit "reversioning" of the Python executable by me…ans of environment module manipulation followed by an `execl()` call. We now present a more consistent environment on the initial execution primarily by doing the following: * Explicit paths to the python executable used at submission * Construction of relevant LD_LIBRARY_PATH and PYTHONPATH This produces a sufficient environment for running Payu in the submitted session without an explicit invocation of environment modules of Python and Payu. We have also made additional changes to support Python 2.6 and Python 3.x, which should enable users to use any desired version of Python. It is hoped that thes changes will enable use of Conda for installing and running Payu. Specific changes below: - `reversion.py` was removed, along with the `reversion.repython` function. We no longer declare an explicit version of Python for submitted jobs, and use whatever one the user used to submit the job. - We now initialise the environment module system for `payu run`, primarily to support PBS (qsub) and MPI execution (mpirun). Other sections may also require environment module initialisation, so this may need to be addressed later. - An error in environment module initialisation related to reading the Module initialisation file (`.modulespath`) has been fixed. - The explicit path to the Python executable is now used in the job submission. NOTE: This implicitly requires a common filesystem across login and compute nodes, which may not be the case in the future. - LD_LIBRARY_PATH and PYTHONPATH are constructed from information available in `sys` and `sysconfig`, rather than from the submission session of the user. Python 2.6 support: - `format` strings now have explict (rather than implicit) arguments for formatting. - Relevant dict comprehensions have been removed (where required) - `iteritems()` iteration has been replaced with `items()` - A backport of `subprocess.check_output` has been provided for the 2.6 module of `subprocess`. Thanks to Eduardo Felipe for providing this on Gist. Python 3.x support: - Bytestream output from `subprocess.check_output` is now decoded to ASCII format. We also do this in Python 2.x although it's a null operation.

Which is 5 years ago.

That is a seriously old version of payu you’re using. This is odd, because this is the version I hacked for Pawsey last year, and it isn’t that old

github.com/payu-org/payu

Porting to pawsey

payu-org:master ← payu-org:pawsey

opened 07:14AM - 08 Apr 22 UTC

aidanheerdegen

+26 -13

Wrap ldd in try/except as executables on pawsey seem to be statically linked. …Also for the same reason don't assume LD_LIBRARY_PATH is set. Commented out call to load_modules. Pawsey has a lot of default modules that it relies on, so can't reliably monkey with that. Removed a couple of the bespoke flags Marshall added to the slurm scheduler, and also explicitly pass through the PAYU environment variables. Also set in the current environment, but that didn't seem to make it through to the submitted job.

So I’m a bit confused how you ended up with such an old version.

micael · 11 September 2023 01:32

So I’m a bit confused how you ended up with such an old version.

With up-to-date filesystem modules, was able to pip install payu.

If you install directly from pypi, you’ll end up with a very old version of payu (see here).

Aidan · 11 September 2023 04:43

Ah ha! Of course, thank you @micael

I had assumed (assumptions are terrible for debugging) that because @john_reilly was using a modified version of the payu code that he would be installing from a local path.

@john_reilly you need to navigate to your modified payu code directory and do

pip install .

I’m guessing this will install it somewhere useful that you have write access to, if the previous pip install payu worked. Otherwise ~~--local~~ --user to install it into your$HOME/.local directory.

Edit: fix error in option. Thanks @angus-g

angus-g · 11 September 2023 06:33

I think you mean --user?

Aidan · 11 September 2023 08:18

Yes, I do. Thanks, and sorry for any misdirection

john_reilly · 14 September 2023 02:16

Thanks for clearing that up.

After doing a pip install . in the new cloned directory (/$MYSOFTWARE/payu_new), the new libraries were setup in the '$MYSOFTWARE/setonix/python/lib/python3.10/site-packages/payu directory and binary file in the $MYSOFTWARE/conda_install/bin/` directory (I think that’s what happened at least).

I’ve modified a few things now to get the payu sweep working, but now I’m stuck on trying to get payu-run working. The slurm batch flags were causing errors, e.g., it didn’t recognise cluster=c4, but I think we had the same issues with the old version so I just copied what I had under the python3.9/site-packages/payu/schedulers/slurm.py to the python3.10/site-packages/payu/schedulers/slurm.py file. That fixed the job-submission issue.

Now, I’m stuck on the environment modules (something to do with env.py). The error in my slurm.out file is:

...
Writing manifests/exe.yaml
payu: Found modules in /opt/cray/pe/lmod/lmod
mod craype-x86-milan
Traceback (most recent call last):
  File "/software/projects/pawsey0410/jreilly/conda_install/bin/payu-run", line 33, in <module>
    sys.exit(load_entry_point('payu==1.0.19', 'console_scripts', 'payu-run')())
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/subcommands/run_cmd.py", line 132, in runscript
    expt.run()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/experiment.py", line 457, in run
    self.load_modules()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/experiment.py", line 246, in load_modules
    envmod.module('unload', mod)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/envmod.py", line 90, in module
    envs, _ = subprocess.Popen(shlex.split(cmd),
  File "/software/projects/pawsey0410/jreilly/conda_install/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/software/projects/pawsey0410/jreilly/conda_install/lib/python3.10/subprocess.py", line 1847, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/cray/pe/lmod/lmod/bin/modulecmd'

It’s right, in that the file/directory doesn’t exist. There’s no bin directory on the path /opt/cray/pe/lmod/lmod/.

Any suggestions here? Thanks for the help!

Aidan · 14 September 2023 07:20

Hi @john_reilly

Sorry, didn’t get a chance to look at this today, and I’m off until next week. If anyone else has some ideas please chip in.

dale.roberts · 15 September 2023 01:04

Hi @john_reilly

It looks like payu is hard-coded to only work with Tcl environment-modules, whereas Setonix uses Lmod. I don’t know off the top of my head if there is a way to ‘load’ a module in python using Lmod in the same way that you can with Tcl modules on Gadi, but I’ll have a look and see what I can see,

In case you’re interested, CLEX CMS has been granted a small allocation on Setonix to test out the viability of installing and maintaining the contanierised version of the hh5 conda analysis environments. This is sort of a pilot project to see what kind of work CMS is able to support outside of NCI. The reason I mention this is that payu is installed within that environment, and if the updates developed here to get payu up and running on Setonix can be fed back into the conda-forge package, it can be incorporated into the environment on Setonix and hopefully save others a bunch of time in getting these kinds of things, as well as important analysis packages like dask, xarray and jupyterlab set up over there. Please let me know if you’re interested in being part of the testing.

Dale

dale.roberts · 15 September 2023 01:32

Hi @john_reilly

OK, I think I have a fix. I’ve saved it here: payu/envmod.py for Lmod on setonix · GitHub
There are two changes, the actual module executable has been changed from $MODULESHOME/bin/modulecmd to $MODULESHOME/libexec/lmod and the system software path has been changed from /apps to /software/setonix. I’ve only tested this small part of Payu, so don’t know if it’ll do exactly what its meant to, but I can confirm that it loads modules correctly within the python environment.

EDIT: After looking more closely at what lib_update() does, its attempting to derive a module name from a library path, so there is no way that what’s there will work with spack’s module directory hierarchy. This may be quite tricky, as there are several versions of e.g. netcdf that depend on compilers and whatnot, so if I derive that the module payu is meant to load is netcdf-c/4.9.0, how can I ensure that payu will load the ‘right’ netcdf-c/4.9.0 of the five that are available? Any ideas @Aidan or anyone else with spack experience?

EDIT AGAIN: gist update with some dodgy assumptions about finding module names

Dale

john_reilly · 15 September 2023 05:42

Thanks Dale!

I’m just waiting in the queue to test the updated version.

I tested it before your ‘EDIT’ and it at least got past the first problem. It’s currently sitting in the queue, but i’ll update when it runs.

EDIT: I imagine the error I got from the first attempt might’ve been fixed with your latest updates to envmod.py - the error I had was along the lines of: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory

EDIT AGAIN: I still get the error above about the libnetcdf.so19.

I know this particular error is something @ChrisC28 and I have seen before but I can’t remember how we fixed it. Any ideas?

dale.roberts · 15 September 2023 06:45

OK, so I’ve dug a bit further into Payu and it looks like it only ever uses lib_update to find MPI modules. ~~This means that the run scripts it generates rely on the applications used in payu run to already know where their libraries are.~~

All Linux applications know the names of the libraries they need to a runtime, but not necessarily where to find them. The usual method of dealing with this is to set LD_LIBRARY_PATH, which is generally handled by modules. In contrast, NCI uses compiler wrappers to set some data in the applications (known as the RPATH) that sets library search paths (Spack also does this). Cray’s solution to this problem is (or was, I’m not sure if they still do it) is to build statically (essentially copying libraries into the application). But whatever you’re running here has not been built statically, so you will need to load the modules used at build time whenever you’re running it. If payu has some configurable ‘pre-script’ setting, that’s where your module load’s should go

EDIT: I found the run part of Payu. The first thing it does is unload all modules. This is fine for Gadi, but I suspect will break things on Setonix. To prevent this, comment out these lines in payu/experiment.py:

# Unload non-essential modules
        loaded_mods = os.environ.get('LOADEDMODULES', '').split(':')

        for mod in loaded_mods:
            if len(mod) > 0:
                print('mod '+mod)
                mod_base = mod.split('/')[0]
                if mod_base not in core_modules:
                    envmod.module('unload', mod)

Ignore all of that stuff above, turns out payu loads the same modules at runtime as it does at build time. I think its the ‘unload everything’ step that’s breaking everything. See if getting rid of that alone fixes your netcdf problem.

Aidan · 15 September 2023 11:36

Thanks for looking into this @dale.roberts

There is an existing issue (and related but stale) PR

It would be great to capture some of the technical detail there, or in a new issue.

There have been some recent changes to the module introspection logic which tripped up spack which didn’t require loading of modules because of the included RPATH information, and the linked MPI modules also didn’t conform to the expected format:

github.com/payu-org/payu

missing -wdir arguments

opened 03:13AM - 02 Jun 23 UTC

closed 02:34AM - 02 Aug 23 UTC

harshula

When testing a Spack build of `access-om2` using Payu, I was receiving the follo…wing errors: ``` ice: error reading coupling_nml ... assertion failed: Input atm.nml does not exist. ``` I noticed that `-wdir` is missing from the arguments given to `mpirun`: `` mpirun --mca io ompio --mca io_ompio_num_aggregators 1 -np 1 $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git/atmosphere/yatm.exe : -np 216 $SCRATCH//access-om2/work/1deg_jra55_ryf_spackv1.git/ocean/fms_ACCESS-OM.x : -np 24 $SCRATCH/access-om2/work/1deg_jra55_ryf_spackv1.git/ice/cice_auscom_360x300_24x1_24p.exe ``

This seems related to the Pawsey issues as it also uses spack.

john_reilly · 20 September 2023 02:46

Hi @dale.roberts

I’ve just added summary of the setonix issues in this github issue; It’d be great to get your input there if you have a chance.

Also I’d be very happy to be a part of the testing on Setonix. Software environments are something I definitely need to become more comfortable with so I’m very keen to get involved in this.

dale.roberts · 21 September 2023 23:59

Hi @john_reilly , I’ve put some ideas in the github issue. I’m not sure how to go about sharing software on Setonix. NCI has their software + writers group setup, which works well enough for their project directory system. I’ll have a think about how to give you access and get back to you.

dale.roberts · 26 September 2023 05:29

Hi @john_reilly, following up on this, it seems like Pawsey’s preferred way of sharing software is to package it in a container and distribute it that way, so that everyone has a copy of it in their own software directories. I’m not a fan of this, it means that if there is some issue discovered in the container and later fixed, you have no way of knowing how many copies of the buggy container are out there, and you end up writing “This is a known issue and was fixed in the last release. Please re-download the software and try again” quite a lot.

In any case, this distribution method won’t work for the analysis3 environments on Setonix, as the way we’ve set these up is to containerise the individual environments rather than the entire installation. This means that there are some hard-coded paths to /software in the base conda installation that aren’t trivial to change. Ideally I’d make my project’s /software directory world readable like hh5 used to be, or add ACLs to the top level /software/<project> directory for more fine-grained access. I’ll get in contact with Pawsey and see what they’re comfortable with, I may end up having to invite you to my project.

Topic		Replies	Views
Payu error Technical help	4	21	9 May 2025
PAYU issues on Leonardo Technical help , payu , inscope	32	228	23 April 2025
Payu can't find deployed modules General help	3	18	25 February 2025
Error with payu and loading modules CABLE payu	7	394	18 July 2023
Payu issues for ACCESS-ESM1.5 on vk83 Earth System Model help , payu , technical	6	51	22 October 2024

PAYU issues on Setonix

Related topics