Payu runlog and git commit signing

I have signing of git commits turned on in my global git config on gadi. As long as I “module load” from vk83 or hh5, then the git version in those modules means that git commits are signed. If I try and use the system git, git commit’s fail - as written up here.

However, I have been trying to use runlog:True in payu, and I get the following error:

$ qcat -e 118782756
Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
error: ssh-keygen -Y sign is needed for ssh signing (available in openssh version 8.2p1+)
error: unknown option -- Y?
usage: ssh-keygen [-q] [-b bits] [-t dsa | ecdsa | ed25519 | rsa] [-m format]
                  [-N new_passphrase] [-C comment] [-f output_keyfile]
       ssh-keygen -p [-P old_passphrase] [-N new_passphrase] [-m format]
                   [-f keyfile]
       ssh-keygen -i [-m key_format] [-f input_keyfile]
       ssh-keygen -e [-m key_format] [-f input_keyfile]
...
fatal: failed to write commit object

Which means that the job which payu started to run the model is trying to commit using the system git (rather than the git in vk83). The expected result is that for each time “payu run” is executed, then one commit is made to the current configuration branch.

Note in the error log from PBS (above) the payu module from vk83 is not loaded inside the job. Which may be why the job us not using the vk83 git binary?

Saving that, there is a long set of environment variable created with the PBS job (below) maybe one of them should be point it to the vk83 git binary?

qstat -f 118782756
Job Id: 118782756.gadi-pbs
    ...
    substate = 42
    Variable_List = PBS_O_HOME=/home/603/as2285,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=as2285,
        PBS_O_PATH=/g/data/vk83/apps/payu/1.1.3/bin:/home/603/as2285/.vscode-s
        erver/cli/servers/Stable-5437499feb04f7a586f677b155b039bc2b3669eb/serve
        r/bin/remote-cli:/home/603/as2285/.local/bin:/home/603/as2285/bin:/g/da
        ta/hh5/public/apps/miniconda3/condabin:/opt/pbs/default/bin:/opt/nci/bi
        n:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/s
        bin:/opt/pbs/default/bin:/opt/pbs/default/bin:/opt/pbs/default/bin,
        PBS_O_MAIL=/var/spool/mail/as2285,PBS_O_SHELL=/bin/bash,
        PBS_O_TZ=:/etc/localtime,PBS_O_HOST=gadi-login-05.gadi.nci.org.au,
        PBS_O_WORKDIR=/g/data/tm70/as2285/payu/MOM6-CICE6,PBS_O_SYSTEM=Linux,
        PAYU_PATH=/g/data/vk83/apps/payu/1.1.3/bin,PAYU_FORCE=True,
        MODULESHOME=/opt/Modules/v4.3.0,
        MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,
        MODULEPATH=/g/data/vk83/modules:/etc/scl/modulefiles:/etc/scl/modulefi
        les:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/m
        odulefiles:/apps/Modules/modulefiles,PBS_NCI_HT=0,
        PBS_NCI_STORAGE=scratch/tm70+gdata/tm70+gdata/qv56+gdata/ik11+gdata/vk
        83,PBS_NCI_IMAGE=,PBS_NCPUS=48,PBS_NGPUS=0,PBS_NNODES=1,
        PBS_NCI_NCPUS_PER_NODE=48,PBS_NCI_NUMA_PER_NODE=4,
        PBS_NCI_NCPUS_PER_NUMA=12,PROJECT=tm70,PBS_VMEM=206158430208,
        PBS_NCI_WD=1,PBS_NCI_JOBFS=10gb,PBS_NCI_LAUNCH_COMPATIBILITY=0,
        PBS_NCI_FS_GDATA1=0,PBS_NCI_FS_GDATA1A=0,PBS_NCI_FS_GDATA1B=0,
        PBS_NCI_FS_GDATA2=0,PBS_NCI_FS_GDATA3=0,PBS_NCI_FS_GDATA4=0,
        PBS_O_QUEUE=normal,PBS_JOBFS=/jobfs/118782756.gadi-pbs
    comment = Job run at Fri Jun 21 at 14:37 on (gadi-cpu-clx-2507:ncpus=48:mem
        =201326592kb:jobfs=10485760kb)
    etime = Fri Jun 21 11:07:37 2024
    run_count = 1
    Submit_arguments = -q normal -P tm70 -l walltime=172800 -l ncpus=48 -l mem=
        192GB -l jobfs=10GB -N 1deg_jra55do_ia -l wd -j n -v PAYU_PATH=/g/data/
        vk83/apps/payu/1.1.3/bin,PAYU_FORCE=True,
        MODULESHOME=/opt/Modules/v4.3.0,
        MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,
        MODULEPATH=/g/data/vk83/modules:/etc/scl/modulefiles:/etc/scl/modulefi
        les:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/m
        odulefiles:/apps/Modules/modulefiles -l storage=gdata/ik11+gdata/qv56+g
        data/tm70+gdata/vk83 -- /g/data/vk83/apps/payu/1.1.3/bin/python3.9 /g/d
        ata/vk83/apps/payu/1.1.3/bin/payu-run
    executable = <jsdl-hpcpa:Executable>/g/data/vk83/apps/payu/1.1.3/bin/python
        3.9</jsdl-hpcpa:Executable>
    argument_list = <jsdl-hpcpa:Argument>/g/data/vk83/apps/payu/1.1.3/bin/payu-
        run</jsdl-hpcpa:Argument>
    project = tm70
    Submit_Host = gadi-login-05.gadi.nci.org.au

Ping @jo-basevi

2 Likes

Thanks @anton for raising this issue!

The git signing issue in the documentation, that you noted above, has the following to set to use an ssh-keygen executable of a later version (>= 8.2p1):

git config --global gpg.ssh.program /g/data/hh5/public/apps/miniconda3/envs/analysis3/bin/ssh-keygen

The above will require access to /g/data/hh5 during the payu run submission, this is fine if payu was loaded from hh5’s conda environment, otherwise will need hh5 added to storage mounts in config.yaml. I tested the above with vk83’s payu and it’s keygen executable: /g/data/vk83/apps/payu/1.1.3/bin/ssh-keygen

Another temporary fix could be manually loading the payu module in the pbs job e.g. in config.yaml (e.g. using the vk83’s payu):

modules:
  use:
    - /g/data/vk83/modules
  load:
    - payu/1.1.3

However testing both of the above in payu run , I ran into the following errors, as the private key was on my laptop:

error: No private key found for public key "/home/189/$USER/.ssh/id_ed25519.pub"?

fatal: failed to write commit object

So I generated a new signing key on gadi and added it to github, but then still ran into errors:

error: Enter passphrase: Load key "/home/189/$USER/.ssh/id_ed25519_sign": incorrect passphrase supplied to decrypt private key?

fatal: failed to write commit object

I then created a passphrase-less key on gadi but I didn’t add it to github, and there were no errors in the payu run pbs submission job. However having a passphrase-less key and a public/private key hosted on gadi does not seem like the best idea. I am no expert when it comes to ssh-keys and don’t know if there’s a way to forward ssh keys to jobs running on compute nodes, so if someone has a better idea, please let me know!

Otherwise, I think adding an environment variable in payu to point to the git binary of the environment loaded could be a good idea. Maybe payu modules could be loaded inside the pbs run job - that might require passing LOADEDMODULES environment variable to the job to be able to find what modules were loaded.

1 Like

Thankyou Jo!

running this solved my problem :slight_smile:

$ git config --global gpg.ssh.program /g/data/vk83/apps/payu/1.1.3/bin/ssh-keygen

I ended up using different keys for commit signing and for pushing commits. For signing the commits I use one without a password (because having to type a password in for every commit in a rebase was painful.)

1 Like

I see now that the solution in the documentation will not be good for all people.

The error @anton experienced I think happened because the payu PBS job does not add gdata/hh5 as a storage path (which is the project where the selected ssh-keygen executable resides).
This would also happen if the selected ssh-keygen executable came from the vk83 project, but its path was not added to the storage of a PBS jobs that needs to sign any git commit.

At the moment, I don’t see a general solution for this because I don’t think there is a valid ssh-keygen executable that resides in a project that would always be accessible from any PBS job.
A “good-enough” workaround, for the moment, might be to replace the setting of the ssh-keygen executable with the following logic:

  • Have a list of projects with a ssh-keygen executables valid for signing commits (>= 8.2p1)
  • Loop through that list and for each project test if the ssh-keygen executable is accessible (or more specifically can be run).
  • If it can be run, set it. If it cannot, continue to the next project.

This might solve the issue in most of the cases, but still does not solve it for all of them (for example, if no project path in the list above is added to storage).
But I don’t see any general solution other than updating Gadi’s “base” ssh-keygen executable, which for now doesn’t seem to be happening.

Do you have better suggestions? @jo-basevi

An alternative to that (if you sign your commits with SSH, which is recommended) is to add the signing ssh-key to the agent at the start of your Gadi session, so you have to insert the password only once every time you log in to Gadi.
I have the following little bash function called addkeys in my ~/.bashrc:

function addkeys {
    echo Starting ssh-agent
    eval "$(ssh-agent -s)"
    echo Adding ssh-key to agent
    ssh-add <path-to-your-signing-ssh-key>
}

So I run addkeys once whenever I need to add my signing ssh-key to the agent and I don’t need to worry about it anymore for the entire Gadi session.
You could also run the script directly in your ~/.bashrc (I do it), but I would not suggest it unless you know exactly under what constraints the script needs to be run (only interactive session, only login session, etc…).

Don’t know if there is any safe (with password and not hardcoded anywhere) alternative way to do it without even having to insert the password every session, but only once.

In the moot points at this stage, but I actually just hadn’t set that variable.

Does the git commit need to happen within the PBS job? Is there a point where it could be executed from the login session ?

Oh you did not set the variable at all!
Well that’s fine. I think the issue would still persist for some users.

I think @jo-basevi will have to answer this.
In any case, I would still add the line

git config --global gpg.ssh.program /g/data/vk83/apps/payu/1.1.3/bin/ssh-keygen

at the start of the payu worklow so there would never be a problem (even if people had not set a ssh-keygen executable or had set a different one in their ~/.bashrc).
And if there is any git commit in a PBS job run by payu, I would also include the line at the beginning of the job, for the same reason.

Yeah this would be ideal, but there’s not really a good place to add a commit with how payu runs at this point. Currently Payu adds a commit after the model has actually been run. When payu run runs, the only step really on the login node is the generation of the pbs submission command. The manifests which get added to the git commits, are created/updated at the start of the PBS job. So adding a commit earlier might require running a payu setup or equivalent to get the manifests before the pbs job has been submitted. Running the experiment setup is probably needed anyway to ensure that model is going to run as don’t want to add an commit if there’s initial errors…

I think rethinking when/how payu does the commit might be needed to support commit signing, so it can work with signing keys that have been forwarded from the host machine, and keys that have a passphrase. I’ve created an new issue to payu here: Support for signing git commits · Issue #449 · payu-org/payu · GitHub

I think this seems like a good workaround in the meantime to work with pbs jobs, with passphrase-less keys.

I am going to reply with a comment in the issue you created.