PAYU issues on Setonix

Hi @Aidan and others,

As discussed yesterday - @ChrisC28 and I are having a few PAYU issues since the Setonix update on Tuesday this week.

A bit of a summary of the problem is:

  • Tried payu sweep and got error that “can’t find payu module”
  • Realised modules such as python, netcdf, hdf5, gcc all needed to be updated - e.g. Lmod has detected the following error: The following module(s) are unknown: "rclone/1.59.2"
  • Used module spider <modulename> then load the new versions of each model. These are now in my ~/.bashrc file.
  • With up-to-date filesystem modules, was able to pip install payu. Path to payu: /software/projects/pawsey0410/jreilly/setonix/python/payu/
  • After that, payu sweep worked, however when trying to run the model, I received a Segmentation Fault that was related to errors within payu python scripts: e.g., something like “missing argument to yaml.load(config.yaml)” which expected a loader so I changed this to "yaml.load(config.yaml, Safeloader) - That fixed that line, but then other’s popped up.

Apologies for the rough explanation but hopefully this will be enough to start the discussion again.

Cheers,
John

Hey @john_reilly

Can you paste the most recent error in a reply. You can use the code formatting to make it more legible

Can you fork payu and push up the version of payu you’re using.

Can you also copy and paste the module load commands you have in your ~/.bashrc.

Thanks!

Below is the error output when trying payu run -n 1.

Looks like there’s a few issues happening…

payu: warning: MODULESHOME does not exist; disabling environment modules.
payu: warning: Environment modules unavailable; aborting reversion.
payu: warning: Job request includes 3 unused CPUs.
payu: warning: CPU request increased from 589 to 592
Traceback (most recent call last):
  File "/software/projects/pawsey0410/jreilly/setonix/python/bin/payu", line 8, in <module>
    cli.parse()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/cli.py", line 62, in parse
    run_cmd(**args)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/subcommands/run_cmd.py", line 97, in runcmd
    cli.submit_job('payu-run', pbs_config, pbs_vars)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/cli.py", line 214, in submit_job
    for k, v in pbs_vars.iteritems())
AttributeError: 'dict' object has no attribute 'iteritems'

The changes Chris and I have made should be in this fork: GitHub - reillyja/payu: A workflow management tool for numerical models on the NCI computing systems

And finally, the bashrc is:


module load PrgEnv-gnu

module load netcdf-c/4.9.0  netcdf-fortran/4.6.0
module load gcc/12.2.0  cray-hdf5/1.12.2.3 cray-netcdf/4.9.0.3
module load python/3.10.10 py-pip/23.1.2-py3.10.10 py-setuptools/68.0.0-py3.10.10
module load cray-mpich/8.1.19 cray-hdf5-parallel/1.12.2.3

Thanks Aidan. Hopefully the above helps

That is very useful, thanks.

This error is because the version of payu you’re using isn’t compatible with python3

AttributeError: 'dict' object has no attribute 'iteritems'

This is weird, as payu was updated for python3 in this commit

Which is 5 years ago.

That is a seriously old version of payu you’re using. This is odd, because this is the version I hacked for Pawsey last year, and it isn’t that old

So I’m a bit confused how you ended up with such an old version.

So I’m a bit confused how you ended up with such an old version.

With up-to-date filesystem modules, was able to pip install payu.

If you install directly from pypi, you’ll end up with a very old version of payu (see here).

Ah ha! Of course, thank you @micael

I had assumed (assumptions are terrible for debugging) that because @john_reilly was using a modified version of the payu code that he would be installing from a local path.

@john_reilly you need to navigate to your modified payu code directory and do

pip install .

I’m guessing this will install it somewhere useful that you have write access to, if the previous pip install payu worked. Otherwise --local --user to install it into your$HOME/.local directory.

Edit: fix error in option. Thanks @angus-g

I think you mean --user?

1 Like

Yes, I do. Thanks, and sorry for any misdirection

Thanks for clearing that up.

After doing a pip install . in the new cloned directory (/$MYSOFTWARE/payu_new), the new libraries were setup in the '$MYSOFTWARE/setonix/python/lib/python3.10/site-packages/payu directory and binary file in the $MYSOFTWARE/conda_install/bin/` directory (I think that’s what happened at least).

I’ve modified a few things now to get the payu sweep working, but now I’m stuck on trying to get payu-run working. The slurm batch flags were causing errors, e.g., it didn’t recognise cluster=c4, but I think we had the same issues with the old version so I just copied what I had under the python3.9/site-packages/payu/schedulers/slurm.py to the python3.10/site-packages/payu/schedulers/slurm.py file. That fixed the job-submission issue.

Now, I’m stuck on the environment modules (something to do with env.py). The error in my slurm.out file is:

...
Writing manifests/exe.yaml
payu: Found modules in /opt/cray/pe/lmod/lmod
mod craype-x86-milan
Traceback (most recent call last):
  File "/software/projects/pawsey0410/jreilly/conda_install/bin/payu-run", line 33, in <module>
    sys.exit(load_entry_point('payu==1.0.19', 'console_scripts', 'payu-run')())
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/subcommands/run_cmd.py", line 132, in runscript
    expt.run()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/experiment.py", line 457, in run
    self.load_modules()
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/experiment.py", line 246, in load_modules
    envmod.module('unload', mod)
  File "/software/projects/pawsey0410/jreilly/setonix/python/lib/python3.10/site-packages/payu/envmod.py", line 90, in module
    envs, _ = subprocess.Popen(shlex.split(cmd),
  File "/software/projects/pawsey0410/jreilly/conda_install/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/software/projects/pawsey0410/jreilly/conda_install/lib/python3.10/subprocess.py", line 1847, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/cray/pe/lmod/lmod/bin/modulecmd'

It’s right, in that the file/directory doesn’t exist. There’s no bin directory on the path /opt/cray/pe/lmod/lmod/.

Any suggestions here? Thanks for the help!

1 Like

Hi @john_reilly

Sorry, didn’t get a chance to look at this today, and I’m off until next week. If anyone else has some ideas please chip in.

Hi @john_reilly

It looks like payu is hard-coded to only work with Tcl environment-modules, whereas Setonix uses Lmod. I don’t know off the top of my head if there is a way to ‘load’ a module in python using Lmod in the same way that you can with Tcl modules on Gadi, but I’ll have a look and see what I can see,

In case you’re interested, CLEX CMS has been granted a small allocation on Setonix to test out the viability of installing and maintaining the contanierised version of the hh5 conda analysis environments. This is sort of a pilot project to see what kind of work CMS is able to support outside of NCI. The reason I mention this is that payu is installed within that environment, and if the updates developed here to get payu up and running on Setonix can be fed back into the conda-forge package, it can be incorporated into the environment on Setonix and hopefully save others a bunch of time in getting these kinds of things, as well as important analysis packages like dask, xarray and jupyterlab set up over there. Please let me know if you’re interested in being part of the testing.

Dale

Hi @john_reilly

OK, I think I have a fix. I’ve saved it here: payu/envmod.py for Lmod on setonix · GitHub
There are two changes, the actual module executable has been changed from $MODULESHOME/bin/modulecmd to $MODULESHOME/libexec/lmod and the system software path has been changed from /apps to /software/setonix. I’ve only tested this small part of Payu, so don’t know if it’ll do exactly what its meant to, but I can confirm that it loads modules correctly within the python environment.

EDIT: After looking more closely at what lib_update() does, its attempting to derive a module name from a library path, so there is no way that what’s there will work with spack’s module directory hierarchy. This may be quite tricky, as there are several versions of e.g. netcdf that depend on compilers and whatnot, so if I derive that the module payu is meant to load is netcdf-c/4.9.0, how can I ensure that payu will load the ‘right’ netcdf-c/4.9.0 of the five that are available? Any ideas @Aidan or anyone else with spack experience?

EDIT AGAIN: gist update with some dodgy assumptions about finding module names

Dale

1 Like

Thanks Dale!

I’m just waiting in the queue to test the updated version.

I tested it before your ‘EDIT’ and it at least got past the first problem. It’s currently sitting in the queue, but i’ll update when it runs.

EDIT: I imagine the error I got from the first attempt might’ve been fixed with your latest updates to envmod.py - the error I had was along the lines of: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory

EDIT AGAIN: I still get the error above about the libnetcdf.so19.

I know this particular error is something @ChrisC28 and I have seen before but I can’t remember how we fixed it. Any ideas?

OK, so I’ve dug a bit further into Payu and it looks like it only ever uses lib_update to find MPI modules. This means that the run scripts it generates rely on the applications used in payu run to already know where their libraries are.

All Linux applications know the names of the libraries they need to a runtime, but not necessarily where to find them. The usual method of dealing with this is to set LD_LIBRARY_PATH, which is generally handled by modules. In contrast, NCI uses compiler wrappers to set some data in the applications (known as the RPATH) that sets library search paths (Spack also does this). Cray’s solution to this problem is (or was, I’m not sure if they still do it) is to build statically (essentially copying libraries into the application). But whatever you’re running here has not been built statically, so you will need to load the modules used at build time whenever you’re running it. If payu has some configurable ‘pre-script’ setting, that’s where your module load’s should go

EDIT: I found the run part of Payu. The first thing it does is unload all modules. This is fine for Gadi, but I suspect will break things on Setonix. To prevent this, comment out these lines in payu/experiment.py:

# Unload non-essential modules
        loaded_mods = os.environ.get('LOADEDMODULES', '').split(':')

        for mod in loaded_mods:
            if len(mod) > 0:
                print('mod '+mod)
                mod_base = mod.split('/')[0]
                if mod_base not in core_modules:
                    envmod.module('unload', mod)

Ignore all of that stuff above, turns out payu loads the same modules at runtime as it does at build time. I think its the ‘unload everything’ step that’s breaking everything. See if getting rid of that alone fixes your netcdf problem.

Thanks for looking into this @dale.roberts

There is an existing issue (and related but stale) PR

It would be great to capture some of the technical detail there, or in a new issue.

There have been some recent changes to the module introspection logic which tripped up spack which didn’t require loading of modules because of the included RPATH information, and the linked MPI modules also didn’t conform to the expected format:

This seems related to the Pawsey issues as it also uses spack.

Hi @dale.roberts

I’ve just added summary of the setonix issues in this github issue; It’d be great to get your input there if you have a chance.

Also I’d be very happy to be a part of the testing on Setonix. Software environments are something I definitely need to become more comfortable with so I’m very keen to get involved in this.

Hi @john_reilly , I’ve put some ideas in the github issue. I’m not sure how to go about sharing software on Setonix. NCI has their software + writers group setup, which works well enough for their project directory system. I’ll have a think about how to give you access and get back to you.

Hi @john_reilly, following up on this, it seems like Pawsey’s preferred way of sharing software is to package it in a container and distribute it that way, so that everyone has a copy of it in their own software directories. I’m not a fan of this, it means that if there is some issue discovered in the container and later fixed, you have no way of knowing how many copies of the buggy container are out there, and you end up writing “This is a known issue and was fixed in the last release. Please re-download the software and try again” quite a lot.

In any case, this distribution method won’t work for the analysis3 environments on Setonix, as the way we’ve set these up is to containerise the individual environments rather than the entire installation. This means that there are some hard-coded paths to /software in the base conda installation that aren’t trivial to change. Ideally I’d make my project’s /software directory world readable like hh5 used to be, or add ACLs to the top level /software/<project> directory for more fine-grained access. I’ll get in contact with Pawsey and see what they’re comfortable with, I may end up having to invite you to my project.