SCM (u-cs845 suite) can't run

Hello ACCESS-HIVE team,
I checked out the suite u-cs845, made a local copy and ran the suite. It succeeded months ago, but I tried rerunning it today. It got the error attached below:

I checked the file named ‘shumlib_trunk’ and it looks like it was modified on 20 June 2023 with module not using ‘setenv var val’ anymore.
About the mkdir error, I made another copy of the u-cs845 making sure there was not pre-existing file ‘share’ or file ‘work’, but it comes out with the same error.

Thanks for any advice in advance,

Thanks,
Qinuo

Hi @qinuo

I had a quick look on the module file, and it appears to have been installed by a new automated procedure. This one has only run once successfully before, and its been stuck since the 21st of June.

From what I can tell, it appears there is a mistake in the script that jenkins runs to create the module file:

cat <<EOF > $MODULE_DIR/shumlib_trunk
#%Module

set help            \"Unified Model Shared Libraries\"
set install-contact  \"accesstester.gadi - https://accessdev.nci.org.au/jenkins/job/BOM/job/shumlib_trunk/\"
set contact          \"Christopher Down - crd548\"
set install-date    \"$DATE_NOW\"
set url             \"https://code.metoffice.gov.uk/trac/utils/wiki/shumlib\"
set version         \"shumlib_trunk\"
set prefix          \"$APPS_DIR/\\\$version\"

conflict shumlib

setenv SHUMLIB_ROOT $prefix
source ~access/modules/common
EOF

All of the module variables are correctly escaped except $prefix in the 3rd from last line. This variable gets expanded by the shell, and so results in a blank in that position in the module file. I think @scott is the best person to look into this. Once this module has been fixed, we can see if the other errors persist.

1 Like

Actually, I have sufficient access to fix this myself. I’ve done that and the module loads successfully now. Turns out this isn’t a new process, but as far as I can tell its always been broken?

Thanks Dale!
It ran successfully at the beginning when Martin suggested this suite.
I tried running the u-cs845 suite again but it remains the error (attached) showing that can’t create directory because they exist though I cleared all the suites/copies I have before checking out, copying, run the u-cs845.

What do you get if you run readlink -f ~/cylc-run/u-cy048/share? It’s possible your task is missing a storage flag

Hello Scott,
Thanks! It returns: /home/565/qh5472/cylc-run/u-cy048/share

Thanks,
Qinuo

Is that the case on gadi as well as accessdev? I’d expect it to print a path under /scratch

On gadi, it returns: /scratch/k10/qh5472/cylc-run/u-cy048/share
On accessdev, it returns: /home/565/qh5472/cylc-run/u-cy048/share

Thanks!

Ah great, if you look in the ‘job’ file for the task there should be a line like

#PBS -l storage=scratch/access+gdata/access+gdata/ab12

this tells the queue system which project disks need to be loaded for this task.

In the suite itself this is set up in site/nci-gadi.rc at line 38

        [[[ directives ]]]
            -P = {{ NCI_PROJECT | default(environ['PROJECT']) }}
            -q = {{ NCI_QUEUE | default('express') }}
            -l ncpus = 1
            -l mem = 1gb
            -l walltime = 0:10:00
            -l jobfs = 1gb
            -W umask = 0022
            -l storage = {{ storage_projects | join('+') }}

You’ll want to add scratch/k10 to the flags, e.g. by updating the variable storage_projects to

{% set storage_projects = ['scratch/access', 'gdata/access', 'gdata/'+environ['PROJECT']], 'scratch/k10' %}

Normally the project you submit the job with gets automatically added. It’s possible that accessdev and gadi have been set up to use different PROJECT codes - run echo $PROJECT to see what the current default is. You can change the default project on accessdev in the file ~/.rashrc and on gadi in the file ~/.config/gadi-login.conf, you will need to log out and back in again for changes to take effect.

1 Like

Thank you so much Scott! It is working now! Thanks

All the best,
Qinuo