Perturbation experiments forked from ACCESS-OM2-025 RYF control run on ik11

Raising this as an independent issue so that we can track it and route it accordingly.

See OPs comments here:

@sschroeter, can you please use this topic for further discussion?

Hi @ben, I appear to have been confused by the RYF configurations - I was attempting to use restarts from the GitHub - rmholmes/025deg_jra55_ryf configuration to warm start the RYF 0.25 degree configuration on GitHub - ACCESS-NRI/access-om2-configs at release-025deg_jra55_ryf, not realising the two are not the same. The two tutorials I found (Running Model Experiments with payu and git and Tutorials · COSIMA/access-om2 Wiki · GitHub) were extremely helpful but it was difficult to find information about running experiments from the older versions of the model. I did not realise that the ACCESS-NRI configuration does not take the restarts from the old version, so I have to use the old version to use the restarts. Is this correct? Or is spun-up data using the ACCESS-NRI 0.25degree RYF available somewhere else?

1 Like

Basically yes, that is correct.

In some circumstances some of the more recent COSIMA control experiments may be compatible with ACCESS-NRI releases, but that isn’t guaranteed and would need to be tested on a case-by-case basis.

No, I’m afraid not. ACCESS-NRI intends to release control experiments that use our released models, but that hasn’t happened yet I’m afraid.

1 Like

I think I must have been doing something wrong when doing payu checkout, i.e. trying to simultaneously run for the branch and restarts with the git hash as per COSIMA Tutorial. This morning I started over with a fresh experiment using the original configuration, switching to the correct branch and being very careful not to change anything except the project codes as I checked out my restarts - now it appears to be working. Thank you for the suggestions @rmholmes and @Dhruv_Bhagtani!

2 Likes

Hey @sschroeter. It might be useful to know what doesn’t work if you can recall/find the commands you used it would be great if you could share them.

@Aidan sure, no problem. I suspect the problem arose when I attempted to use the git commit identifier, as per the tutorial (Running Model Experiments with payu and git). Specifically, as an example, the following commands:

git clone https://github.com/rmholmes/025deg_jra55_ryf my_new_example

cd my_new_example/

payu checkout -r /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/restart250/ -b new_branch_name a29b0e4

This seemed to produce a different config file, directing to input_rc. I updated this as per Ryan’s advice, to be input_20200530, as well as changing the project codes, and did the usual creation of output250 & restart250 folders on scratch, and copying the cice_in.nml across. Trying to payu run after this produces an error file with:

Currently Loaded Modulefiles:
 1) openmpi/4.1.4(default)   2) pbs
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'
payu: Model exited with error code 1; aborting.

But the error was not produced when I instead did the following:

git clone https://github.com/rmholmes/025deg_jra55_ryf my_new_example

cd my_new_example/

payu checkout -b new_branch_name ryf9091_gadi -r /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/restart250/ 

So I stopped using the git commit identifier. I’m not sure what it did, but it seems to work ok without it? (Please correct me if I’m wrong!)

1 Like

[I hope it’s ok, I moved the replies from the original post to here]

Thanks for providing that detail @sschroeter.

When I tried tried to reproduce your error I also found and reported the payu bug when checking out an experiment with an incompatible project/laboratory, so thanks!

I located the problem. payu examines the openmpi libraries that the executables are linked against and tries to load the modules necessary to “find” those important libraries.

The executables in that config are linked against libraries that don’t have a corresponding module that can be loaded:

$ ldd /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_97e3429_libaccessom2_1bb8904.x | grep -i libmpi.so    
        libmpi.so.40 => /apps/openmpi-mofed4.7-pbs19.2/4.0.1/lib/libmpi.so.40 (0x000014c292564000) 

So payu reports an error when it runs

module load openmpi-mofed4.7-pbs19.2

as @Dhruv_Bhagtani surmised.

But I think this is a red herring. When I do as you suggest and use the most recent commit in the ryf9091_gadi branch

git clone https://github.com/rmholmes/025deg_jra55_ryf my_new_example
cd my_new_example/
payu checkout -b new_branch_name ryf9091_gadi -r /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/restart250/

It is still using the same executable that references the problematic library path

$ ldd /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_97e3429_libaccessom2_1bb8904.x | grep libmpi.so
        libmpi.so.40 => /apps/openmpi-mofed4.7-pbs19.2/4.0.1/lib/libmpi.so.40 (0x000014a873abd000)

I can confirm it also runs for me but I do see the error messages in the output

Currently Loaded Modulefiles:
 1) pbs   2) openmpi/4.1.4(default)  
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'
ERROR: Unable to locate a modulefile for 'openmpi-mofed4.7-pbs19.2/4.0.1'

Note that these are just messages, the model runs fine.

I think the real difference is the later commit has changes to the coupling remapping weights:

diff --git a/namcouple b/namcouple
index ec7dd51..5cebfbe 100644
--- a/namcouple
+++ b/namcouple
@@ -94,7 +94,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_conserve.nc dst
+../INPUT/rmp_jra55_cice_1st_conserve.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 #########
 # Field 02 : lwflx down
@@ -105,7 +105,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_conserve.nc dst
+../INPUT/rmp_jra55_cice_1st_conserve.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 03 : rainfall
@@ -116,7 +116,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_conserve.nc dst
+../INPUT/rmp_jra55_cice_1st_conserve.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 04 : snowfall
@@ -127,7 +127,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_conserve.nc dst
+../INPUT/rmp_jra55_cice_1st_conserve.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 05 : surface pressure
@@ -138,7 +138,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_smooth.nc dst
+../INPUT/rmp_jra55_cice_patch.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 06 : runoff. Runoff is passed on the destination grid.
@@ -158,7 +158,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_smooth.nc dst
+../INPUT/rmp_jra55_cice_patch.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 08 : 2m air humidity
@@ -169,7 +169,7 @@ P  0  P  0
 #
 LOCTRANS MAPPING SCRIPR
 INSTANT
-../INPUT/rmp_jra55_cice_smooth.nc dst
+../INPUT/rmp_jra55_cice_patch.nc dst
 CONSERV LR SCALAR LATLON 10 FRACNNEI FIRST
 ##########
 # Field 09 : 10m wind (u)
@@ -179,7 +179,7 @@ jrat cict LAG=0 SEQ=+1
 P  0  P  0
 #
 MAPPING SCRIPR
-../INPUT/rmp_jra55_cice_smooth.nc dst
+../INPUT/rmp_jra55_cice_patch.nc dst
 DISTWGT LR VECTOR LATLON 10 4 vwnd_ai
 ##########
 # Field 10 : 10m wind (v)
@@ -189,7 +189,7 @@ jrat cict LAG=0 SEQ=+1
 P  0  P  0
 #
 MAPPING SCRIPR
-../INPUT/rmp_jra55_cice_smooth.nc dst
+../INPUT/rmp_jra55_cice_patch.nc dst
 DISTWGT LR VECTOR LATLON 10 4 uwnd_ai
 #############################################################################

I’m doing a run now from restart250, but I would be very surprised if this is bitwise identical with @rmholmes’ control experiment at this point using the modified remapping weights.

This means a perturbation experiment forked at this point utilising the later configuration is not a clean comparison, as there likely will be differences due to the changed configuration.

1 Like

Thanks @aidan my advise to @sschroeter was to do an extra section of control as well as the pertubation run as well not to rely on the experiments being clean compariosn. Thanks for your help, I noticed @ben has taken over the topic as well so he must have been giving advice as well.

If you’re going to do that I’d recommend using the ACCESS-NRI released model versions if possible. That way we can provide some support.

I say “if possible” because I tried restarting the equivalent ACCESS-NRI configuration (release-025deg_jra55_ryf) and got a bunch of floating point exceptions, probably because the restarts aren’t compatible with this version of the model.

==== backtrace (tid:3205509) ====
 0 0x0000000000012d20 __funlockfile()  :0
 1 0x000000000062b3c8 ocean_thickness_mod_mp_thickness_restart_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/spack-src/src/mom5/ocean_core/ocean_thickness.F90:2257
 2 0x000000000067c415 ocean_thickness_mod_mp_ocean_thickness_init_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/spack-src/src/mom5/ocean_core/ocean_thickness.F90:633
 3 0x0000000000463ebf ocean_model_mod_mp_ocean_model_init_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/spack-src/src/mom5/ocean_core/ocean_model.F90:1269
 4 0x0000000000413f83 MAIN__()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/spack-src/src/accessom_coupler/ocean_solo.F90:360
 5 0x00000000004102a2 main()  ???:0
 6 0x000000000003a7e5 __libc_start_main()  ???:0
 7 0x00000000004101ae _start()  ???:0
=================================
forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source             
fms_ACCESS-OM.x    0000000001D1EF74  Unknown               Unknown  Unknown
libpthread-2.28.s  000014A04C029D20  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    000000000062B3C8  ocean_thickness_m        2257  ocean_thickness.F90
fms_ACCESS-OM.x    000000000067C415  ocean_thickness_m         633  ocean_thickness.F90
fms_ACCESS-OM.x    0000000000463EBF  ocean_model_mod_m        1269  ocean_model.F90
fms_ACCESS-OM.x    0000000000413F83  MAIN__                    360  ocean_solo.F90
fms_ACCESS-OM.x    00000000004102A2  Unknown               Unknown  Unknown
libc-2.28.so       000014A04BC7B7E5  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM.x    00000000004101AE  Unknown               Unknown  Unknown
==== backtrace (tid: 158677) ====

It should be possible to just use the ocean state (temp/salt) as initial conditions and start from rest, but it would take a bit more spinning up and I haven’t investigated this.

Hi everyone,

I am not sure exactly what is going on here. I don’t remember changes to the remapping weights along the way, but I know there were lots of changes due to renaming file paths because of inputs moving around. If I diff GitHub - rmholmes/025deg_jra55_ryf at ryf9091_gadi with a random much earlier commit (e.g. git diff a810f4aa7c) then I do indeed see an md5 hash change that would be worth further investigation:

However, I’d advise against just throwing this run out as your control run because of the above, without further investigation. This simulation has done 650 years of spin-up. That’s 4.6MSU and about 2 months worth of run time, so it would be best to take advantage of it. It might not be quite bitwise reproducible, but I’d be surprised if it’s not close enough for practical purposes.

I would always advise continuing the control run in parallel as @sofarrell has suggested, so at least the two runs are consistent.

At first glance it looks like the remapping weights were changed / discussed through this issue, to move from second order to 1st order:

-work/INPUT/rmp_jra55_cice_conserve.nc:
-fullpath:/g/data4/ik11/inputs/access-om2/input_rc/common_025deg_jra55/rmp_jra55_cice_conserve.nc
+work/INPUT/rmp_jra55_cice_1st_conserve.nc:
+fullpath:/g/data/ik11/inputs/access-om2/input_20200530/common_025deg_jra55/rmp_jra55_cice_1st_conserve.nc

Not sure what’s going on with the original version (red) - input_rc was a release candidate, not intended for production use, and has now been removed (or perhaps it was renamed to /g/data4/ik11/inputs/access-om2/input_rc-DELETE, but that doesn’t contain common_025deg_jra55/rmp_jra55_cice_conserve.nc).

Actually, according to this, rmp_jra55_cice_conserve.nc is 2nd order so should be compared to rmp_jra55_cice_2nd_conserve.nc, and from that screenshot it looks like those md5 hashes match (2f76b8…).

But perhaps the newer configurations differ in using 1st order conservative for more of the fields, as discussed here.

It should be possible to use the spin-up by taking the temp/salt restart (and age?) and starting from rest with the ACCESS-NRI released model versions.

It may well be easier than trying to do all the forensics on a very old run.

That could be worth trying. But I don’t know what other differences there are between your released versions and the old version. They could be more extensive (physics wise) than a remapping weights file difference?

This is most likely what we’d like to try for our experiments - with a control alongside perturbed runs (as @sofarrell and @rmholmes said). ACCESS-NRI assistance to do so would be invaluable.

Renaming topic to reflect issue context.

1 Like

OP has indicated this issue can be closed. They will raise a new ticket in future if needed.

2 Likes