ACCESS-CM2 persistent session problems

Hi all,

I have been setting up a new experiment (dd in ACCESS-CM2 branching from suite cy286 that would run with no issues on the accessdev machine.

I encounter an error in the fcm_make_drivers process shown below:

[FAIL] config-file= - https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/fcm_make/drivers.cfg@712
[FAIL] https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/fcm_make/drivers.cfg@712: cannot load config file
[FAIL] https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/fcm_make/drivers.cfg@712: not found
[FAIL] svn: E170013: Unable to connect to a repository at URL 'https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/fcm_make/drivers.cfg'
[FAIL] svn: E215004: No more credentials or we tried too many times.
[FAIL] Authentication failed

[FAIL] fcm make -f /scratch/e14/sm2435/cylc-run/u-dd534/work/09900101/fcm_make_drivers/fcm-make.cfg -C /home/561/sm2435/cylc-run/u-dd534/share/fcm_make_drivers -j 4 mirror.target=gadi.nci.org.au:cylc-run/u-dd534/share/fcm_make_drivers mirror.prop{config-file.name}=2 # return-code=1
2024-02-16T03:36:49Z CRITICAL - failed/EXIT

This was an issue on accessdev if you hadnā€™t run access-auth and saved credentials - see this topic below:

On the persistent session access-auth is not a script or command that can run, so how can we save credentials for this to run?
Let me know if this is the issue or if this is being caused form another error

Thanks!
Sebastian

Hi Sebastian,
Your suite is trying to access https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/fcm_make/drivers.cfg

In order to do so, your suite needs to supply NCI login credentials to trac.nci.org.au and it cannot do so from within the persistent session.

The solution is to either migrate the access_tools repository elsewhere, or to use a different repository to obtain access-cm2-drivers/trunk/fcm_make/drivers.cfg

Luckily, such a repository already exists. See GitHub - ACCESS-NRI/access-cm2-drivers: Driver scripts for the ACCESS-CM2 coupled model for the repository and see the file suite.rc in roses-u suite u-cy339 for an example of how to use this repository.

Please note that the use of access-cm2-drivers in u-cy339 differs from that in u-cy286. In u-cy339 the git clone command is used instead of fcm_make. This is because the suite is dealing with a Git repository instead of a Subversion repository.

Please also note that ACCESS-CM2 setup and tutorial issues/problems - #2 by MartinDix is now out of date. In particular, both the trac.nci.org.au server and the accessdev.nci.org.au server are being retired, with accessdev due to be switched off by March 2024.

2 Likes

Thanks @paulleopardi

Can I just copy over the suite.rc from cy339 to my suite and it will work?

Or am I better off modifying a new version of cy339 to do what my suite does?

Thanks,
Sebastian

Hi Sebastian,
Here is what I would do. Take a look at the instructions for running CM, then check out u-cy339 and note the differences between it and your u-cy286. Update u-cy286 to match the necessary differences to u-cy339 in order to run the suite without reference to trac.nci.org.au or accessdev.nci.org.au. You may want to instead take a copy of u-cy286 and update the copy.

1 Like

NCI also has some resources for transitioning from accessdev to the persistent sessions:

https://opus.nci.org.au/display/DAE/Moving+from+Accessdev

Including weekly zoom help sessions.

Thanks all for the help!

Here is what I did to get the suite working and likely what you need to do if there are old suites from accessdev to move over

In suite.rc you need to modify a few lines in different places

  1. in the [scheduling] block change fcm_make_drivers and fcm_make2_drivers (if applicable) to make_drivers
  2. In the [runtime] block, change the `{% if BUILD_DRIVERS %}ā€™ sub block to the following
{% if BUILD_DRIVERS %}
    [[make_drivers]]
        inherit = BUILD, NCI
        script = """
cd $CYLC_SUITE_SHARE_DIR
if [ -d access-cm2-drivers ] ; then
  rm -rf access-cm2-drivers
fi
git clone https://github.com/ACCESS-NRI/access-cm2-drivers.git
"""
        [[[directives]]]
            -q = copyq
{% endif %}
  1. So that the new driver scripts get read, go to this line in suite.rc
[runtime]
...
      [[NCI]] 

add in this line t the top :

        env-script = "eval $(rose task-env --path=share/access-cm2-drivers/src)"

Still under [runtime], , [[NCI]] change the remote host to the following:

[[[remote]]]               
              host = localhost
  1. under scheduling the ā€˜if BUILD_UMā€™ line. => fcm_make2_um so the lines becomes
{% if BUILD_UM %}
fcm_make_um
{% endif%}
  1. delete references to fcm-make.cfg from the fcm_make_um processes in the [runtime] block. The block should now read:
{% if BUILD_UM %}
    [[fcm_make_um]]
        inherit = BUILD, NCI_BUILD, UMBUILD, NCI
        [[[ environment ]]]
            ROSE_TASK_OPTIONS = --archive
        [[[job]]]
            execution time limit = PT40M
        [[[directives]]]
            -l ncpus=6
            -l mem = 24gb
            -l jobfs = 2gb
            -q = {{NCIEXQ}}
{% endif %}
  1. there is also a change to make in the app/fcm_make_um/rose-app.conf to ensure that the UM builds correctly
mirror=preprocess-atmos build-atmos preprocess-recon build-recon
  1. Include lines on SHARE_NODES at the beginning of suite.rc so it changes from
{% set ICE_NPROCS = ((NXBLK_ICE*NYBLK_ICE)/CICE_MAXBK)|round(0,'ceil')|int %}
{% set NNODE_OCNICE = ((MOM_CPUS+ICE_NPROCS)/NSLOTS)|round(0,'ceil')|int %}
{% set NUM_NODES = NNODE_ATM + NNODE_OCNICE %}
# Should allow for undercommitted nodes here

To this

... (lines above unchanged
{% set ICE_NPROCS = ((NXBLK_ICE*NYBLK_ICE)/CICE_MAXBK)|round(0,'ceil')|int %}
{% if SHARE_NODES %}
# Allow ocean and ice models to share node
{% set NNODE_OCNICE = ((MOM_CPUS+ICE_NPROCS)/NSLOTS)|round(0,'ceil')|int %}
{% set NUM_NODES = NNODE_ATM + NNODE_OCNICE %}
{% else %}
{% set NUM_NODES = NNODE_ATM + NNODE_OCN + NNODE_ICE %}
{% endif %}
# Should allow for undercommitted nodes here
...lines below unchanged

and finally, add a line in the [coupled] block. Find #For CICE and add in the last line here on share nodes

           #For CICE
            ICE_NPROCS={{ICE_NPROCS}}
            NSLOTS = {{NSLOTS}}
            SHARE_NODES = {{SHARE_NODES}}

in rose-suite.conf add
SHARE_NODES=true
Lastly, change COMPUTE_HOST in rose-suite.conf to ā€œlocalhostā€ since the job is being submitted from gadi now

Hopefully these changes can help anyone trying to port a suite that was on accessdev and using fcm_make_drivers

If you have any questions let me know!
Cheers,
Sebastian

1 Like

The above changes get the model to install, but when running the model I come across a new issue. At the end of a timestep, I get an error

/local/spool/pbs/mom_priv/jobs/108887102.gadi-pbs.SC: line 146: save_wallclock.sh: command not found
2024-02-22T03:36:20Z CRITICAL - failed/EXIT

The save_wallclock.sh script is referenced in my old suite.rc file, but it is not the the suite.rc for cy399 (example suite)

This script is also not present in the access-dm-drivers git repo, so that is why it is not able to run access-cm2-drivers/src at main Ā· ACCESS-NRI/access-cm2-drivers Ā· GitHub
However, the script can be found in the old drivers repo which used fcm_make_drivers https://trac.nci.org.au/svn/access_tools/access-cm2-drivers/trunk/

My question now is, is this save_wallclock.sh script needed, what does it do, and is there an alternative process put in place if it is required for the model to correctly run?
@paulleopardi, maybe you know?

Hi @sebmckenna
Sorry, I donā€™t know. Please repost your new question in a separate topic, so that it can be brought to the attention of others, and so that I can mark this topic as solved and close it. Thanks.

Awaiting input from OP

OK thanks.
I was advised save_wallclock.sh most likely doesnā€™t interact with the suite so i turned the process off and the full coupled job runs.

I am now having issues with the the post processing which I think is related to some changes in access-cm2-drivers to fcm_make_drivers. Since this is related to the driver issue is it better to keep this all in one post or start a new one?

Please start a new topic. Your new question is far enough away from your original one to merit this. Also, this topic has been solved.