ESM1.6 Development using NRI repos and PAYU

@Aidan suggested I document the issue here as it is getting un-neccessarily complicated in the general chat.

Background
The source code in NRIs UM7 repo is a copy of the CMIP6 version of ESM1.5. In UM7.3 the native Land-Surface Model (LSM) was MOSES (which later morphed into JULES). In UM7.3 MOSES was embedded in the UM code base. In ACCESS-ESM1.5 we coupled CABLE to the UM as the LSM, and thus also embedded it into the UM codebase. The cable/casa files are scattered in src/atmosphere/boundary_layer/ directory. It is difficult to qualify exactly but we call this version of CABLE, CABLE2.4. Aside from changes on the ocean-side, changes to LUC scheme, Tammas’ thinning work etcetera, the difference between ESM1.5 and ESM1.6 with respect to CABLE is that we want to use CABLE3. We have been running a prototype ESM1.5+CABLE3 for 18 months now. It is in the process of refinement ATM, however we build/run within the script based framework which we used in CMIP6.

Aside from algorithm differences, science developments, refactored code within CABLE itself, the biggest difference from the UM’s perspective is that in ESM1.5+CABLE3, all of the cable/casa files scattered in src/atmosphere/boundary_layer/ are removed and replace by src/atmosphere/CABLE/ where here CABLE is a clone of the CABLE:main branch. This version of CABLE is then the same across applications offline, JAC, AM3 (although AM3 updates are still being knocked about in PRs). All of this builds/runs in our CMIP6 script based framework. Pearse has been running the aforementioned ESM1.6 ocean coupled to an executable from here.

Issue
So, hopefully this will un-complicate things. All I am trying to do is reproduce this using NRI’s framework. I CAN checkout a plain vanilla pre-ind. PAYU, point to a UM7 tag and run what is essentially ESM1.5.

Branching from main in the UM7 repo I remove all existing CABLE and replace with CABLE3 as decribed above.

In the ACCESS-ESM1.6 repo I can point this branch, and successfully deploy to Gadi. I can take this executable and run it using our script based method. BUT pointing PAYU to this module and picking up the same executable - the model crashes in the coupler.

I am going to try establishing a deployed pre-release of an unmodified branch from UM7.
DOH! deployment halted on error

And there we go. payu pointing to ACCESS-ESM1.6 deployed executables where only change is to user branch in UM7 repo. That user branch is just a straight up copy of main. run collapses on first time step. Not necessarily in coupler but top-level UM, talking to mod_prism* oasis* which is coupler related I suspect. Any ideas?

Hi @Jhan, would you be happy for me to take a look at the directory that you are running the payu simulation from, in case there are any payu issues that stand out?

Thanks,
Spencer

of course;

/g/data/p66/jxs599/ESM16/PAYU/preindustrial+concentrations

Thanks @Jhan!
I don’t currently have read permission for this directory. Would you be happy to add read permissions for it? Otherwise making a readable copy somewhere accessible on scratch is also an option.

/scratch/public/jxs599/Spencer/

I also just read your earlier post.

I tried just now with my unadulterated branch of main (UM7) and same deal. It’s definitely not something Ive touched with the code. It could be in my branching, gitting etc but I dont even think thats it. Including tracers is a good culprit. In no way do I require anything but the old ocean for now. It is probably even better, as then we can better compare developments in CABLE alone.

Where/How did you get the error you posted?

sorry I didnt read properly. That particular error isnt. These processor errors in the UM output are not very reliable. Particularly the order in which they occur. I have dozens of FATAL:on PE XX errors. Many wild goose chases they have initiated over the years. I’ll get back to you with another problem later which is seemingly intractable - but more about that later, it isnt related to this problem.

Further down from the described FATAL error messages are traceback errors. Again these are hard to make sense of. No errors anywhere near the LSM. Not that this totally absolves the LSM, BUT in my later run that also failed I didnt touch the code AND the same executable runs fine outside of payu.

Hi @Jhan, unfortunately it’s still not letting me read /scratch/public/jxs599/Spencer/. Could I get you to try copying it into /scratch/public/sw6175?

done

I forgot you need -x to open directories

@Jhan @spencerwong what is missing in the public/ directory are permissions for others: chmod -R o+rX /scratch/public/jxs599/Spencer/. Just mentioning it in case it’s useful here or later.

Thanks @clairecarouge! @Jhan if you are happy to run the above command that would be great, as I’m having difficulty accessing the copied directories.

Longer term, Claire mentioned that it would be a better option for everyone to share payu configurations by pushing them to personal repositories on github. There are instructions for this in the Training Day payu tutorial, which I’ll copy the relevant parts from below:

(I’ve modified the instructions to use SSH rather than https)

Authorise with GitHub

gh is included in payu modules supported by ACCESS-NRI. As long as the payu command is available gh should be also.

The first step is to authorise with GitHub:

gh auth login

This will prompt for a series of responses. Select the responses used below:

? What account do you want to log into? GitHub.com
? What is your preferred protocol for Git operations on this host? SSH
? Upload your SSH public key to your GitHub account? Skip
? How would you like to authenticate GitHub CLI? Login with a web browser

! First copy your one-time code: XXXX-XXXX
Press Enter to open github.com in your browser... 

At this point you will get an error opening a browser on gadi:

! Failed opening a web browser at https://github.com/login/device
  exec: "xdg-open,x-www-browser,www-browser,wslview": executable file not found in $PATH
  Please try entering the URL in your browser manually

So open Open Sign in to GitHub · GitHub in your browser, authenticate with GitHub if you’re not already logged in, copy the one-time code from your terminal window and paste it in. Then authentication should complete:

✓ Authentication complete.
- gh config set -h github.com git_protocol ssh
✓ Configured git protocol
! Authentication credentials saved in plain text
✓ Logged in as xxxxxxxxxxx

To check status use gh auth status

$ gh auth status
github.com
  ✓ Logged in to github.com account xxxxxxxxx (/home/XXX/xxxXXX/.config/gh/hosts.yml)
  - Active account: true
  - Git operations protocol: https
  - Token: gho_************************************
  - Token scopes: 'gist', 'read:org', 'repo'

Push to repo

Next navigate to the payu control directory that you will be sharing, and make sure to commit any changes that you have made.

Now you can create a repository from your control directory using

gh repo create

Follow the prompts and enter the information requested. Repository name will default to the directory name of your control directory. The repository owner should be your GitHub username used to authenticate. Choose public visibility. The remote name is just an alias to your repository that git uses when doing a push or pull.

? What would you like to do? Push an existing local repository to GitHub
? Path to local repository .
? Repository name XXXXX
? Repository owner xxxxxxxxx
? Description A nice description of the purpose of the repository
? Visibility Public
✓ Created repository xxxxxxxxx/XXXXX on GitHub
  https://github.com/xxxxxxxxx/XXXXX
? Add a remote? Yes
? What should the new remote be called? myrepo
✓ Added remote git@github.com:xxxxxxxxx/XXXXX.git
? Would you like to push commits from the current branch to "myrepo"? Yes

This will create a personal repository containing the configuration on github that can then be shared. Let me know if you have any questions!

Thanks,
Spencer

I’ve made the directories executable so you should be able to see everything now. I’ll have a look at the tutorial a bit later. I have some sorting to do with the runs I have going

Thanks @Jhan, would I be able to get you to adjust the permissions for the parent directory with chmod -R o+rX /scratch/public/jxs599 – I think hopefully I should then be able to access it. Apologies for all the messing around!

Done. Apology not necessary. I’m sure the mask on public used to already do this.

Thanks @Jhan, that’s working for me now.

I’ve taken a look and think I have an idea of why the issue is occurring. I’ll give a quick summary of the main points:

  • The MOM5 executable deployed by the pull request is called fms_ACCESS-ESM.x, while the config.yaml file specifies the name of the MOM5 executable as fms_ACCESS-CM.x, the name previously used for the ESM1.5 builds.

  • When payu searches for the executable fms_ACCESS-CM.x specified in the config.yaml, it fails to find it and falls back to searching /scratch/$PROJECT/$USER/access-esm/bin/. Normally, this would be empty and payu would exit before starting the simulation. In your case, it has actually found an executable in this directory under the path /scratch/p66/jxs599/access-esm/bin/fms_ACCESS-CM.x, and has tried to run with it. (The manifests/exe.yaml file shows the paths to the executables payu has found).

  • The executable at /scratch/p66/jxs599/access-esm/bin/fms_ACCESS-CM.x predates the latest released ESM1.5 payu configurations and isn’t compatible with it, leading to the model crashing only after it has started running.

Recommendations:

  • The ESM1.5 configurations at GitHub - ACCESS-NRI/access-esm1.5-configs: Standard ACCESS-ESM1.5 configurations released and supported by ACCESS-NRI aren’t compatible with the ESM1.6 MOM5 executables which use the generic tracers version of WOMBAT. @Aidan and the model release team are working on a suitable ESM1.6 pre-industrial configuration, which might be the best configuration for developers to branch from when it’s ready.

  • In the mean time, if you would like to run the Cable 3 changes with ESM1.5’s version of the ocean, you can replace the MOM5 version to the latest ESM1.5 version in the spack.yaml file – i.e. replace

    mom5:
       require:
         - '@git.dev_2024.08.14=access-esm1.6'
         - '+access-gtracers'
    

    with

    mom5:
       require:
         - '@git.access-esm1.5_2024.08.23=access-esm1.5'
    

    @harshula – let me know if I’m missing anything here!

  • It would be best to also remove any executables from the /scratch/p66/jxs599/access-esm/bin/ directory, as payu can pick them up when it fails to find the correct ones.


Details on the crash in case they are useful:

In the control directory, the access.err error file contains the following:

FATAL from PE   59: MPP_OPEN:INPUT/ocmip2_xkw_monthly_om1p5_bc.nc does not exist.

MOM5 is failing to find the above ocmip2_xkw_monthly_om1p5_bc.nc input file. This file isn’t actually required for any of the ESM1.5 configurations, and so the MOM5 code was recently modified to no longer require it (these changes were brought into this release).

Potentially, the executable used in the run /scratch/p66/jxs599/access-esm/bin/fms_ACCESS-CM.x predates these changes, causing it to crash when it is run with the released ESM1.5 pre-industrial configuration.

Let me know if you have any questions about this!
Cheers,
Spencer

1 Like

thanks @spencerwong, I’ll hopefully get to this in the afternoon

so what about the lines further down:

access-fms:
require:
- ‘@development
access-generic-tracers:
require:
- ‘@development
access-mocsy:
require:
- ‘@mom5

I had a message from @harshula this morning to. mega in recent changes to required packages (or something) but now I can’t find it. I think this is what is stopping CI build now

Hi Jhan

I think Test run branch um7b by JhanSrbinovsky · Pull Request #13 · ACCESS-NRI/ACCESS-ESM1.6 · GitHub is working now - see the comments in that PR