ACCESS-OM2 Restart Reproducibility: Bitwise Reproducibility Testing

Intro

Restart reproducibility means that a model will produce the same answers regardless of when the model is stopped and restarted (in model time).

The classic test for restart reproducibility is that two otherwise identically configured experiments that start from the same initial conditions and run for 2 days produce identical (bitwise reproducible) outputs despite one experiment being run in a single 2 days segment, and the other as two 1 day segments.

Note that a model can still be bitwise reproducible without restart reproducibility, but it requires that exactly the same run protocol is observed, with exactly the same run segment timing and length.

When restart reproducibility testing was turned back on in CI the historical checksums no longer seemed to match

Quick summary

Making +restart_repro default True breaks bitwise reproducibility with previous experiments.

Making this change will mean future experiments cannot reproduce past experiments.

Investigation

The change is just in the mom5 build, as the other models have bit repro flags on by default.

The change was introduced here:

But when @jo-basevi put the restart repro tests back into the CI repro testing the historical repro tests failed, but the restart repro tests passed (as expected).

Detailed below is the attempt to verify these results, and confirm that it was not an artefact of changes to the spack build that caused this.

If a COSIMA build of mom5 with the same restart repro options produced the same outcome (non-reproducible with ā€œhistoricalā€ runs), and that build was bitwise reproducible with the spack build, then we can confirm that the change is real and a result of adding restart reproducibility to MOM5.

Testing setup

So first I set up a test directory and cloned a config and the test suite

$ pwd
/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf

$ git clone https://github.com/ACCESS-NRI/access-om2-configs/ test-code
Cloning into 'test-code'...
remote: Enumerating objects: 5559, done.
remote: Counting objects: 100% (3416/3416), done.
remote: Compressing objects: 100% (1313/1313), done.
remote: Total 5559 (delta 2396), reused 3074 (delta 2072), pack-reused 2143
Receiving objects: 100% (5559/5559), 1.45 MiB | 7.58 MiB/s, done.
Resolving deltas: 100% (3824/3824), done.

$ payu clone -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs.git 1deg_jra55_ryf

Cloned repository from https://github.com/ACCESS-NRI/access-om2-configs.git to directory: /g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf
Checked out branch: release-1deg_jra55_ryf
laboratory path:  /scratch/tm70/aph502/access-om2
binary path:  /scratch/tm70/aph502/access-om2/bin
input path:  /scratch/tm70/aph502/access-om2/input
work path:  /scratch/tm70/aph502/access-om2/work
archive path:  /scratch/tm70/aph502/access-om2/archive
Updated metadata. Experiment UUID: 8bf3f9b0-9246-4fa8-9cac-5b22f5ced4b5
Added archive symlink to /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-8bf3f9b0
To change directory to control directory run:
  cd 1deg_jra55_ryf

And then run the current tests to confirm historical bit repro still passing.

First attempt to run test failed

Then ran this, which specifies a non-default path for the temporary testing files:

pytest -s --output-path ../test-model-repro ../test-code/test -m checksum

got errors about not being able to access the executable

--------------------------------------------------------------------------

mpirun was unable to launch the specified application as it could not access or execute an executable:


Executable: ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe

Node: gadi-cpu-clx-2151

while attempting to start process rank 0.
--------------------------------------------------------------------------

Path to executable seems fine

$ ls -l ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe

lrwxrwxrwx 1 aph502 tm70 158 Mar 23 10:50 ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe -> '/g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/libaccessom2-git.2023.10.26=2023.10.26-ieiy3e7hidn4dzaqly3ly2yu45mecgq4/bin/yatm.exe'

Added storage flags in case that was an issue

$ git diff config.yaml
**diff --git a/config.yaml b/config.yaml**
**index c851966..c44860f 100644**
**--- a/config.yaml**
**+++ b/config.yaml**
@@ -92,6 +92,12 @@ env:
 platform:
     nodesize: 48
  
+storage:
+    gdata:
+       - tm70
+       - vk83
+
+

# sweep and resubmit on specific errors - see https://github.com/payu-org/payu/issues/241#issuecomment-610739771                 
 userscripts:
     error: tools/resub.sh

Run tests in default location

Next step was to try using a default output location, that worked

$ pytest -s ../test-code/test -m checksum                                                    
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 3 deselected / 1 selected                                                                                     

../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_historical/1deg_jra55_ryf.o111578873']
.

=========================================== 1 passed, 3 deselected in 210.03s (0:03:30) ===========================================

Run updated tests to replicate error

FIrst step: unchanged model build

Checked out Joā€™s restart PR

$ git checkout origin/11-Add-restart-reproducibility-tests -b 11-Add-restart-reproducibility-tests

moved the test output dir to a backup location

$ mv test-model-repro test-model-repro-bkup2

and re-ran. Confirmed we get one fail (restart repro) and one pass (historical repro):

Show full output
$ pytest -s ../test-code/test -m checksum                                                     
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0                                                                        
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro  
collected 4 items / 2 deselected / 2 selected                                                                                     
                                                          
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histor$
cal/1deg_jra55_ryf.o111580543']
.['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111580661']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111580711']               
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111580819']                 
Unequal checksum: Zonal velocity: -747450584393924602
Unequal checksum: Meridional velocity: 5786533594912504816                                                                        
Unequal checksum: Advection of u: -2427551909895310013          
Unequal checksum: Advection of v: 8414702469715514620                                                                             
Unequal checksum: rho(taup1): -7825281413282575106
Unequal checksum: pressure_at_depth: 1545293312163002545
Unequal checksum: denominator_r: 1030217802578759450                                                                              
Unequal checksum: drhodT: 7040067001143210402
Unequal checksum: drhodS: -7443012524785806309     
Unequal checksum: drhodz_zt: 4102980311895092158
Unequal checksum: temp: 5576915484520666247
Unequal checksum: salt: -5334557539792698373             
Unequal checksum: age_global: 1282193799700320580          
Unequal checksum: pot_temp: -3952700955953607101
Unequal checksum: frazil: -692106203165952893
Unequal checksum: ending agm_array: -4817981991690884677 
Unequal checksum: ending rossby_radius: -313567614221891037
Unequal checksum: ending rossby_radius_raw: 8136091513197178599
Unequal checksum: ending bih_viscosity: -3604182982324579526
Unequal checksum: ending lap_viscosity: 6584646363536773466
Unequal checksum: thickness_sigma: -8986798025844122441
Unequal checksum: eta_t: 3592422156285373700
Unequal checksum: eta_u: 8622206305623958648           
Unequal checksum: deta_dt: -4241427670142365519        
Unequal checksum: eta_t_bar: -7394049172613969207
Unequal checksum: pbot_t: -5235402108008676132          
Unequal checksum: pbot_u: 8416463192834025535
Unequal checksum: anompb: 5973341933699066300
Unequal checksum: ps: 8772801155042670188
Unequal checksum: grad_ps_1: -5830096513531237293
Unequal checksum: grad_ps_2: -8648781909002437020                                                                                 
Unequal checksum: udrho: -2926523761059235683          
Unequal checksum: vdrho: -2962865050298577393                                                                                     
Unequal checksum: conv_rho_ud_t: 491504259505012306
Unequal checksum: source: -24801538497346018       
Unequal checksum: eta smoother: -24801538497346018
Unequal checksum: eta_nonbouss: 2288190392690979502          
Unequal checksum: forcing_u_bt: 1578950095382092263
Unequal checksum: forcing_v_bt: -5668783177147836576             
Unequal checksum: Thickness%rho_dzt(taup1): 6728181715574374999
Unequal checksum: Thickness%rho_dzu(taup1): 6253132191740020748
Unequal checksum: Thickness%mass_u(taup1): -4338831531743958235
Unequal checksum: Thickness%rho_dzten(1): 21437755887756755
Unequal checksum: Thickness%rho_dzten(2): -8637438697470860740
Unequal checksum: Thickness%rho_dztr: 7534179586392920215
Unequal checksum: Thickness%rho_dzur: 8496882432575222725
Unequal checksum: Thickness%rho_dzt_tendency: 4303812782867412364                                                                 
Unequal checksum: Thickness%dzt: 284301941868280081      
Unequal checksum: Thickness%dzten(1): -6271744555322450474                                                                        
Unequal checksum: Thickness%dzten(2): 3494164494412952332
Unequal checksum: Thickness%dztlo: -1575651486725050759                                                                           
Unequal checksum: Thickness%dztup: -1575717921746418697
Unequal checksum: Thickness%dzt_dst: -2181590507461284892
Unequal checksum: Thickness%dzwt(k=0): -3934741698621914372
Unequal checksum: Thickness%dzwt(k=1:nk): -306092990192632628
Unequal checksum: Thickness%dzu: 8895240687531472467
Unequal checksum: Thickness%dzwu(k=0): 6327813631221876380       
Unequal checksum: Thickness%dzwu(k=1:nk): 3309385497261750929                                                               [3/511]
Unequal checksum: Thickness%depth_zt: -3045642215283562736
Unequal checksum: Thickness%geodepth_zt: 8337040943737743546
Unequal checksum: Thickness%depth_zu: -5563661097699173755
Unequal checksum: Thickness%depth_zwt: -762755888625758492
Unequal checksum: Thickness%depth_zwu: -5176151996342750639
Unequal checksum: Thickness%mass_en(1): 759811470748328892
Unequal checksum: Thickness%mass_en(2): 1245147370288381158
F

============================================================ FAILURES =============================================================
____________________________________________ TestBitReproducibility.test_restart_repro ____________________________________________

self = <test_bit_reproducibility.TestBitReproducibility object at 0x14ca4da4f1c0>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')

    @pytest.mark.checksum
    def test_restart_repro(self, output_path: Path, control_path: Path):
        """
        Test that a run reproduces across restarts.
        """
        # First do two short (1 day) runs.
        exp_2x1day = setup_exp(control_path, output_path,
                               'test_restart_repro_2x1day')

        # Reconfigure to a 1 day run.
        exp_2x1day.model.set_model_runtime(seconds=86400)

        # Now run twice.
        exp_2x1day.setup_and_run()
        exp_2x1day.force_qsub_run()

        # Now do a single 2 day run
        exp_2day = setup_exp(control_path, output_path,
                             'test_restart_repro_2day')
        # Reconfigure
        exp_2day.model.set_model_runtime(seconds=172800)

        # Run once.
        exp_2day.setup_and_run()

        # Now compare the output between our two short and one long run.
        checksums_1d_0 = exp_2x1day.extract_checksums()
        checksums_1d_1 = exp_2x1day.extract_checksums(exp_2x1day.output001)

        checksums_2d = exp_2day.extract_checksums()

        # Use model specific comparision method for checksums
        model = exp_2day.model
        matching_checksums = model.check_checksums_over_restarts(
            long_run_checksum=checksums_2d,
            short_run_checksum_0=checksums_1d_0,
            short_run_checksum_1=checksums_1d_1
        )

        if not matching_checksums:
            # Write checksums out to file
            with open(output_path / 'restart-1d-0-checksum.json', 'w') as file:
                json.dump(checksums_1d_0, file, indent=2)
            with open(output_path / 'restart-1d-1-checksum.json', 'w') as file:
                json.dump(checksums_1d_1, file, indent=2)
            with open(output_path / 'restart-2d-0-checksum.json', 'w') as file:
                json.dump(checksums_2d, file, indent=2)

>       assert matching_checksums
E       assert False

../test-code/test/test_bit_reproducibility.py:125: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_restart_repro - assert False
====================================== **1 failed**, 1 passed, 2 deselected in 646.52s (0:10:46) ======================================

Use +restart_repro build

Updated exe to point to pre-release environment in /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-4

from this PR that updates to the newer restart repro

$ grep mom5 /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-4/spack.location
mom5@git.2023.11.09=2023.11.09          /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x

Showing the updated exe paths:

$ git diff HEAD^
diff --git a/config.yaml b/config.yaml                                                                                             index c851966..5d44f70 100644                     
--- a/config.yaml
+++ b/config.yaml                             
@@ -41,7 +41,7 @@ submodels:              
                                              
     - name: ocean                        
       model: mom                                                                                                                 
-      exe: /g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/bin/fms_ACCESS-OM.x                                                                                            
+      exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x

Running again, the historical test fails but the restart repro pass, reproducing Joā€™s finding.

Show full output
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0                                                                        
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro                                                                           
collected 4 items / 2 deselected / 2 selected                                                                                     
                                 
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111582364']                                 
F['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111582399']              
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111582526']               
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111582819']
.                                                                                                                                  
                                      
============================================================ FAILURES =============================================================
________________________________________ TestBitReproducibility.test_bit_repro_historical _________________________________________

self = <test_bit_reproducibility.TestBitReproducibility object at 0x151923fb6c70>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')
checksum_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf/testing/checksum/historical-3hr-checksum$
json')

    @pytest.mark.checksum
    def test_bit_repro_historical(self, output_path: Path, control_path: Path,
                                  checksum_path: Path):
        """
        Test that a run reproduces historical checksums
        """
        # Setup checksum output directory
        # NOTE: The checksum output file is used as part of `repro-ci` workflow
        output_dir = output_path / 'checksum'
        output_dir.mkdir(parents=True, exist_ok=True)
        checksum_output_file =  output_dir / 'historical-3hr-checksum.json'
        if checksum_output_file.exists():
            checksum_output_file.unlink()

        # Setup and run experiment
        exp = setup_exp(control_path, output_path, "test_bit_repro_historical")
        exp.model.set_model_runtime()
        exp.setup_and_run()

        assert exp.model.output_exists()

        #Check checksum against historical checksum file
        hist_checksums = None
        hist_checksums_schema_version = None

        if not checksum_path.exists():  # AKA, if the config branch doesn't have a checksum, or the path is misconfigured
            hist_checksums_schema_version = exp.model.default_schema_version
        else:  # we can use the historic-3hr-checksum that is in the testing directory
            with open(checksum_path, 'r') as file:
                hist_checksums = json.load(file)

                # Parse checksums using the same version
                hist_checksums_schema_version = hist_checksums["schema_version"]

        checksums = exp.extract_checksums(schema_version=hist_checksums_schema_version)

        # Write out checksums to output file
        with open(checksum_output_file, 'w') as file:
            json.dump(checksums, file, indent=2)

>       assert hist_checksums == checksums, f"Checksums were not equal. The new checksums have been written to {checksum_output_fil
e}."
E       AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/chec
ksum/historical-3hr-checksum.json.
E       assert {'output': {'...ion': '1-0-0'} == {'output': {'...ion': '1-0-0'}
E         
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {'output': {'Advection of u': ['0', '-5944066210705683418'], 'Advection of v': ['0', '-3606245701812142045'], 'Meridional
 velocity': ['9051849634365276068', '7718829051214123787'], 'Thickness%depth_st': ['-436572698594795605'], ...}} != {'output': {'Ad
vection of u': ['0', '-5944066163830149791'], 'Advection of v': ['0', '-3606245664043050147'], 'Meridional velocity': ['90518496343
65276068', '7718829052070798169'], 'Thickness%depth_st': ['-436572698594795605'], ...}}
E         Use -v to get more diff

../test-code/test/test_bit_reproducibility.py:51: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_bit_repro_historical - AssertionError: Checksums
 were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksu...
====================================== 1 failed, 1 passed, 2 deselected in 556.42s (0:09:16) ======================================

Just naively try the most recent pre-release build as nci-openmpi hadnā€™t been correctly turned off in the previous build

$ grep mom5 /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-5/spack.location 
mom5@git.2023.11.09=2023.11.09          /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz
Show details of config diff
$ git diff
diff --git a/config.yaml b/config.yaml
index 5d44f70..d7d1ac3 100644
--- a/config.yaml
+++ b/config.yaml
@@ -41,7 +41,7 @@ submodels:
 
     - name: ocean
       model: mom
-      exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x
+      exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz/bin/fms_ACCESS-OM.x
       input:
           - /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/grid_spec.nc
           - /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_hgrid.nc
diff --git a/manifests/exe.yaml b/manifests/exe.yaml
index 3cf5dd2..f934389 100644
--- a/manifests/exe.yaml
+++ b/manifests/exe.yaml
@@ -12,7 +12,7 @@ work/ice/cice_auscom_360x300_24x1_24p.exe:
     binhash: 6bff005e04c23c579f37b7b2c0189793
     md5: 5e7c7ba864da95cd1329d098f1e47776
 work/ocean/fms_ACCESS-OM.x:
-  fullpath: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x
+  fullpath: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz/bin/fms_ACCESS-OM.x
   hashes:
-    binhash: 4f791838e696d241e1839f4a60405083
-    md5: c44d552cb9131f7ceeeaca975254eb46
+    binhash: d088e1384d7449e15b403154525cf894
+    md5: 960c43c8f2cbd0ca6fc4946034b07f3c
[aph502@gadi-login-02 1deg_jra55_ryf]$ git commit -a -m 'Update mom5 exe to access-om2-2024_03_0-5 pre-release'
[release-1deg_jra55_ryf 55cae2e] Update mom5 exe to access-om2-2024_03_0-5 pre-release
 2 files changed, 4 insertions(+), 4 deletions(-)

Backed up output directory

$ mv test-model-repro test-model-repro-bkup4

Run again

$ pytest -s ../test-code/test -m checksum

With exactly the same result as above. Doh!

Confirm with COSIMA build

To rule out any issue with the spack build, built MOM5 with COSIMA build script.

[Had red-herring with a ā€œbad buildā€ but details not important, so deleted]

Current build options

Built new COSIMA mom5 executable from scratch with freshly cloned repo (/g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig) backed up previous
test dir, replaced exe path with this new build and run again

$ mv /scratch/tm70/aph502/test-model-repro /scratch/tm70/aph502/test-model-repro-bkup6
Config diff
$ git diff
diff --git a/config.yaml b/config.yaml
index 6df99cd..62ade77 100644
--- a/config.yaml
+++ b/config.yaml
@@ -41,7 +41,7 @@ submodels:
 
     - name: ocean
       model: mom
-      exe: /g/data/tm70/aph502/access-om2-release/bitrepro/access-om2/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x
+      exe: /g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x
       input:
           - /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/grid_spec.nc
           - /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_hgrid.nc
[aph502@gadi-login-02 1deg_jra55_ryf]$ git commit -a -m 'Original COSIMA build exe, no --repro flag'
[release-1deg_jra55_ryf 2b6cbf5] Original COSIMA build exe, no --repro flag
 1 file changed, 1 insertion(+), 1 deletion(-)

We expect the historical test to work, and the restart repro test not

Show full output
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 2 deselected / 2 selected                                                                                          
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111593502']                        
.['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111593610']             
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111593710']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111593897']
Unequal checksum: Zonal velocity: -747450584393924602
Unequal checksum: Meridional velocity: 5786533594912504816   
Unequal checksum: Advection of u: -2427551909895310013
Unequal checksum: Advection of v: 8414702469715514620            
Unequal checksum: rho(taup1): -7825281413282575106             
Unequal checksum: pressure_at_depth: 1545293312163002545       
Unequal checksum: denominator_r: 1030217802578759450           
Unequal checksum: drhodT: 7040067001143210402              
Unequal checksum: drhodS: -7443012524785806309                
Unequal checksum: drhodz_zt: 4102980311895092158         
Unequal checksum: temp: 5576915484520666247              
Unequal checksum: salt: -5334557539792698373 
Unequal checksum: age_global: 1282193799700320580        
Unequal checksum: pot_temp: -3952700955953607101
Unequal checksum: frazil: -692106203165952893            
Unequal checksum: ending agm_array: -4817981991690884677
Unequal checksum: ending rossby_radius: -313567614221891037
Unequal checksum: ending rossby_radius_raw: 8136091513197178599
Unequal checksum: ending bih_viscosity: -3604182982324579526
Unequal checksum: ending lap_viscosity: 6584646363536773466  
Unequal checksum: thickness_sigma: -8986798025844122441
Unequal checksum: eta_t: 3592422156285373700                     
Unequal checksum: eta_u: 8622206305623958648  
Unequal checksum: deta_dt: -4241427670142365519
Unequal checksum: eta_t_bar: -7394049172613969207                                
Unequal checksum: pbot_t: -5235402108008676132            
Unequal checksum: pbot_u: 8416463192834025535
Unequal checksum: anompb: 5973341933699066300
Unequal checksum: ps: 8772801155042670188
Unequal checksum: grad_ps_1: -5830096513531237293
Unequal checksum: grad_ps_2: -8648781909002437020
Unequal checksum: udrho: -2926523761059235683
Unequal checksum: vdrho: -2962865050298577393
Unequal checksum: conv_rho_ud_t: 491504259505012306
Unequal checksum: source: -24801538497346018
Unequal checksum: eta smoother: -24801538497346018
Unequal checksum: eta_nonbouss: 2288190392690979502
Unequal checksum: forcing_u_bt: 1578950095382092263
Unequal checksum: forcing_v_bt: -5668783177147836576
Unequal checksum: Thickness%rho_dzt(taup1): 6728181715574374999
Unequal checksum: Thickness%rho_dzu(taup1): 6253132191740020748
Unequal checksum: Thickness%mass_u(taup1): -4338831531743958235
Unequal checksum: Thickness%rho_dzten(1): 21437755887756755
Unequal checksum: Thickness%rho_dzten(2): -8637438697470860740
Unequal checksum: Thickness%rho_dztr: 7534179586392920215
Unequal checksum: Thickness%rho_dzur: 8496882432575222725
Unequal checksum: Thickness%rho_dzt_tendency: 4303812782867412364
Unequal checksum: Thickness%dzt: 284301941868280081
Unequal checksum: Thickness%dzten(1): -6271744555322450474
Unequal checksum: Thickness%dzten(2): 3494164494412952332
Unequal checksum: Thickness%dztlo: -1575651486725050759
Unequal checksum: Thickness%dztup: -1575717921746418697
Unequal checksum: Thickness%dzt_dst: -2181590507461284892
Unequal checksum: Thickness%dzwt(k=0): -3934741698621914372
Unequal checksum: Thickness%dzwt(k=1:nk): -306092990192632628
Unequal checksum: Thickness%dzu: 8895240687531472467
Unequal checksum: Thickness%dzwu(k=0): 6327813631221876380
Unequal checksum: Thickness%dzwu(k=1:nk): 3309385497261750929
Unequal checksum: Thickness%depth_zt: -3045642215283562736
Unequal checksum: Thickness%geodepth_zt: 8337040943737743546
Unequal checksum: Thickness%depth_zu: -5563661097699173755
Unequal checksum: Thickness%depth_zwt: -762755888625758492
Unequal checksum: Thickness%depth_zwu: -5176151996342750639
Unequal checksum: Thickness%mass_en(1): 759811470748328892
Unequal checksum: Thickness%mass_en(2): 1245147370288381158
F

============================================================ FAILURES ============================================================$
____________________________________________ TestBitReproducibility.test_restart_repro ___________________________________________$

self = <test_bit_reproducibility.TestBitReproducibility object at 0x14ff010ed280>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')

    @pytest.mark.checksum
    def test_restart_repro(self, output_path: Path, control_path: Path):
        """
        Test that a run reproduces across restarts.
        """
        # First do two short (1 day) runs.
        exp_2x1day = setup_exp(control_path, output_path,
                               'test_restart_repro_2x1day')

        # Reconfigure to a 1 day run.
        exp_2x1day.model.set_model_runtime(seconds=86400)

        # Now run twice.
        exp_2x1day.setup_and_run()
        exp_2x1day.force_qsub_run()

        # Now do a single 2 day run
        exp_2day = setup_exp(control_path, output_path,
                             'test_restart_repro_2day')
        # Reconfigure
        exp_2day.model.set_model_runtime(seconds=172800)

        # Run once.
        exp_2day.setup_and_run()

        # Now compare the output between our two short and one long run.                                                          
        checksums_1d_0 = exp_2x1day.extract_checksums()
        checksums_1d_1 = exp_2x1day.extract_checksums(exp_2x1day.output001)                                                       

        checksums_2d = exp_2day.extract_checksums()

        # Use model specific comparision method for checksums
        model = exp_2day.model
        matching_checksums = model.check_checksums_over_restarts(                                                                 
            long_run_checksum=checksums_2d,
            short_run_checksum_0=checksums_1d_0,
            short_run_checksum_1=checksums_1d_1
        )

        if not matching_checksums:
            # Write checksums out to file
            with open(output_path / 'restart-1d-0-checksum.json', 'w') as file:                                                   
                json.dump(checksums_1d_0, file, indent=2)
            with open(output_path / 'restart-1d-1-checksum.json', 'w') as file:                                                   
                json.dump(checksums_1d_1, file, indent=2)
            with open(output_path / 'restart-2d-0-checksum.json', 'w') as file:                                                   
                json.dump(checksums_2d, file, indent=2)

>       assert matching_checksums
E       assert False

../test-code/test/test_bit_reproducibility.py:125: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_restart_repro - assert False                   
====================================== 1 failed, 1 passed, 2 deselected in 700.52s (0:11:40) ======================================

Which is what we see. Hooray! Without --repro COSIMA and spack builds are bit repro. Both reproduce the same historical checksums.

Add --repro option

Ok, so backup again

$ mv /scratch/tm70/aph502/test-model-repro /scratch/tm70/aph502/test-model-repro-bkup7 

Copy in COSIMA build executable with --repro (built from exactly the same directory as the previous one, just added the --repro option in install.sh):

/g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x

and run again!

Summary

$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0                                               rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro                         collected 4 items / 2 deselected / 2 selected                                                                                       
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111596226']                                 
F['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111596422']              
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111596649']               ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111596899']                

============================================================ FAILURES =============================================================
________________________________________ TestBitReproducibility.test_bit_repro_historical _________________________________________

self = <test_bit_reproducibility.TestBitReproducibility object at 0x14efc5b24dc0>                                                 
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')                                        
checksum_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf/testing/checksum/historical-3hr-checksum.json')

    @pytest.mark.checksum
    def test_bit_repro_historical(self, output_path: Path, control_path: Path,                                                    
                                  checksum_path: Path):
        """
        Test that a run reproduces historical checksums
        """
        # Setup checksum output directory
        # NOTE: The checksum output file is used as part of `repro-ci` workflow                                                   
        output_dir = output_path / 'checksum'
        output_dir.mkdir(parents=True, exist_ok=True)
        checksum_output_file =  output_dir / 'historical-3hr-checksum.json'                                                       
        if checksum_output_file.exists():
            checksum_output_file.unlink()

        # Setup and run experiment
        exp = setup_exp(control_path, output_path, "test_bit_repro_historical")                                                   
        exp.model.set_model_runtime()
        exp.setup_and_run()

        assert exp.model.output_exists()

        #Check checksum against historical checksum file
        hist_checksums = None
        hist_checksums_schema_version = None

        if not checksum_path.exists():  # AKA, if the config branch doesn't have a checksum, or the path is misconfigured         
            hist_checksums_schema_version = exp.model.default_schema_version                                                      
        else:  # we can use the historic-3hr-checksum that is in the testing directory                                            
            with open(checksum_path, 'r') as file:
                hist_checksums = json.load(file)

                # Parse checksums using the same version
                hist_checksums_schema_version = hist_checksums["schema_version"]                                                  

        checksums = exp.extract_checksums(schema_version=hist_checksums_schema_version)                                           

        # Write out checksums to output file
        with open(checksum_output_file, 'w') as file:
            json.dump(checksums, file, indent=2)

>       assert hist_checksums == checksums, f"Checksums were not equal. The new checksums have been written to {checksum_output_file}."
E       AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksum/historical-3hr-checksum.json.
E       assert {'output': {'...ion': '1-0-0'} == {'output': {'...ion': '1-0-0'}                                                   
E         
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {'output': {'Advection of u': ['0', '-5944066210705683418'], 'Advection of v': ['0', '-3606245701812142045'], 'Meridional velocity': ['9051849634365276068', '7718829051214123787'], 'Thickness%depth_st': ['-436572698594795605'], ...}} != {'output': {'Advection of u': ['0', '-5944066163830149791'], 'Advection of v': ['0', '-3606245664043050147'], 'Meridional velocity': ['9051849634365276068', '7718829052070798169'], 'Thickness%depth_st': ['-436572698594795605'], ...}}                                           
E         Use -v to get more diff

../test-code/test/test_bit_reproducibility.py:51: AssertionError
===============
====================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_bit_repro_historical - AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksu...                          
====================================== 1 failed, 1 passed, 2 deselected in 684.46s (0:11:24) ======================================

SUCCESS! We have consistent behaviour with the spack +restart_repro build: historical does not reproduce, but restart repro does.

What is more the the checksums now match between the spack and COSIMA builds:

$ diff /scratch/tm70/aph502/test-model-repro/checksum/historical-3hr-checksum.json /scratch/tm70/aph502/test-model-repro-bkup5/checksum/historical-3hr-checksum.json
[aph502@gadi-login-02 access-om2]
$

Conclusion

Turning on reproducibility options that maintain restart reproducibility breaks historical reproducibility with previous runs that did not have these options turned on.

3 Likes