Intro
Restart reproducibility means that a model will produce the same answers regardless of when the model is stopped and restarted (in model time).
The classic test for restart reproducibility is that two otherwise identically configured experiments that start from the same initial conditions and run for 2 days produce identical (bitwise reproducible) outputs despite one experiment being run in a single 2 days segment, and the other as two 1 day segments.
Note that a model can still be bitwise reproducible without restart reproducibility, but it requires that exactly the same run protocol is observed, with exactly the same run segment timing and length.
When restart reproducibility testing was turned back on in CI the historical checksums no longer seemed to match
Quick summary
Making +restart_repro
default True breaks bitwise reproducibility with previous experiments.
Making this change will mean future experiments cannot reproduce past experiments.
Investigation
The change is just in the mom5
build, as the other models have bit repro flags on by default.
The change was introduced here:
But when @jo-basevi put the restart repro tests back into the CI repro testing the historical repro tests failed, but the restart repro tests passed (as expected).
Detailed below is the attempt to verify these results, and confirm that it was not an artefact of changes to the spack
build that caused this.
If a COSIMA build of mom5
with the same restart repro options produced the same outcome (non-reproducible with āhistoricalā runs), and that build was bitwise reproducible with the spack
build, then we can confirm that the change is real and a result of adding restart reproducibility to MOM5.
Testing setup
So first I set up a test directory and cloned a config and the test suite
$ pwd
/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf
$ git clone https://github.com/ACCESS-NRI/access-om2-configs/ test-code
Cloning into 'test-code'...
remote: Enumerating objects: 5559, done.
remote: Counting objects: 100% (3416/3416), done.
remote: Compressing objects: 100% (1313/1313), done.
remote: Total 5559 (delta 2396), reused 3074 (delta 2072), pack-reused 2143
Receiving objects: 100% (5559/5559), 1.45 MiB | 7.58 MiB/s, done.
Resolving deltas: 100% (3824/3824), done.
$ payu clone -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs.git 1deg_jra55_ryf
Cloned repository from https://github.com/ACCESS-NRI/access-om2-configs.git to directory: /g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf
Checked out branch: release-1deg_jra55_ryf
laboratory path: /scratch/tm70/aph502/access-om2
binary path: /scratch/tm70/aph502/access-om2/bin
input path: /scratch/tm70/aph502/access-om2/input
work path: /scratch/tm70/aph502/access-om2/work
archive path: /scratch/tm70/aph502/access-om2/archive
Updated metadata. Experiment UUID: 8bf3f9b0-9246-4fa8-9cac-5b22f5ced4b5
Added archive symlink to /scratch/tm70/aph502/access-om2/archive/1deg_jra55_ryf-release-1deg_jra55_ryf-8bf3f9b0
To change directory to control directory run:
cd 1deg_jra55_ryf
And then run the current tests to confirm historical bit repro still passing.
First attempt to run test failed
Then ran this, which specifies a non-default path for the temporary testing files:
pytest -s --output-path ../test-model-repro ../test-code/test -m checksum
got errors about not being able to access the executable
--------------------------------------------------------------------------
mpirun was unable to launch the specified application as it could not access or execute an executable:
Executable: ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe
Node: gadi-cpu-clx-2151
while attempting to start process rank 0.
--------------------------------------------------------------------------
Path to executable seems fine
$ ls -l ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe
lrwxrwxrwx 1 aph502 tm70 158 Mar 23 10:50 ../test-model-repro/lab/work/1deg_jra55_ryf-test_bit_repro_historical/atmosphere/yatm.exe -> '/g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/libaccessom2-git.2023.10.26=2023.10.26-ieiy3e7hidn4dzaqly3ly2yu45mecgq4/bin/yatm.exe'
Added storage flags in case that was an issue
$ git diff config.yaml
**diff --git a/config.yaml b/config.yaml**
**index c851966..c44860f 100644**
**--- a/config.yaml**
**+++ b/config.yaml**
@@ -92,6 +92,12 @@ env:
platform:
nodesize: 48
+storage:
+ gdata:
+ - tm70
+ - vk83
+
+
# sweep and resubmit on specific errors - see https://github.com/payu-org/payu/issues/241#issuecomment-610739771
userscripts:
error: tools/resub.sh
Run tests in default location
Next step was to try using a default output location, that worked
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 3 deselected / 1 selected
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_historical/1deg_jra55_ryf.o111578873']
.
=========================================== 1 passed, 3 deselected in 210.03s (0:03:30) ===========================================
Run updated tests to replicate error
FIrst step: unchanged model build
Checked out Joās restart PR
$ git checkout origin/11-Add-restart-reproducibility-tests -b 11-Add-restart-reproducibility-tests
moved the test output dir to a backup location
$ mv test-model-repro test-model-repro-bkup2
and re-ran. Confirmed we get one fail (restart repro) and one pass (historical repro):
Show full output
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 2 deselected / 2 selected
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histor$
cal/1deg_jra55_ryf.o111580543']
.['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111580661']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111580711']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111580819']
Unequal checksum: Zonal velocity: -747450584393924602
Unequal checksum: Meridional velocity: 5786533594912504816
Unequal checksum: Advection of u: -2427551909895310013
Unequal checksum: Advection of v: 8414702469715514620
Unequal checksum: rho(taup1): -7825281413282575106
Unequal checksum: pressure_at_depth: 1545293312163002545
Unequal checksum: denominator_r: 1030217802578759450
Unequal checksum: drhodT: 7040067001143210402
Unequal checksum: drhodS: -7443012524785806309
Unequal checksum: drhodz_zt: 4102980311895092158
Unequal checksum: temp: 5576915484520666247
Unequal checksum: salt: -5334557539792698373
Unequal checksum: age_global: 1282193799700320580
Unequal checksum: pot_temp: -3952700955953607101
Unequal checksum: frazil: -692106203165952893
Unequal checksum: ending agm_array: -4817981991690884677
Unequal checksum: ending rossby_radius: -313567614221891037
Unequal checksum: ending rossby_radius_raw: 8136091513197178599
Unequal checksum: ending bih_viscosity: -3604182982324579526
Unequal checksum: ending lap_viscosity: 6584646363536773466
Unequal checksum: thickness_sigma: -8986798025844122441
Unequal checksum: eta_t: 3592422156285373700
Unequal checksum: eta_u: 8622206305623958648
Unequal checksum: deta_dt: -4241427670142365519
Unequal checksum: eta_t_bar: -7394049172613969207
Unequal checksum: pbot_t: -5235402108008676132
Unequal checksum: pbot_u: 8416463192834025535
Unequal checksum: anompb: 5973341933699066300
Unequal checksum: ps: 8772801155042670188
Unequal checksum: grad_ps_1: -5830096513531237293
Unequal checksum: grad_ps_2: -8648781909002437020
Unequal checksum: udrho: -2926523761059235683
Unequal checksum: vdrho: -2962865050298577393
Unequal checksum: conv_rho_ud_t: 491504259505012306
Unequal checksum: source: -24801538497346018
Unequal checksum: eta smoother: -24801538497346018
Unequal checksum: eta_nonbouss: 2288190392690979502
Unequal checksum: forcing_u_bt: 1578950095382092263
Unequal checksum: forcing_v_bt: -5668783177147836576
Unequal checksum: Thickness%rho_dzt(taup1): 6728181715574374999
Unequal checksum: Thickness%rho_dzu(taup1): 6253132191740020748
Unequal checksum: Thickness%mass_u(taup1): -4338831531743958235
Unequal checksum: Thickness%rho_dzten(1): 21437755887756755
Unequal checksum: Thickness%rho_dzten(2): -8637438697470860740
Unequal checksum: Thickness%rho_dztr: 7534179586392920215
Unequal checksum: Thickness%rho_dzur: 8496882432575222725
Unequal checksum: Thickness%rho_dzt_tendency: 4303812782867412364
Unequal checksum: Thickness%dzt: 284301941868280081
Unequal checksum: Thickness%dzten(1): -6271744555322450474
Unequal checksum: Thickness%dzten(2): 3494164494412952332
Unequal checksum: Thickness%dztlo: -1575651486725050759
Unequal checksum: Thickness%dztup: -1575717921746418697
Unequal checksum: Thickness%dzt_dst: -2181590507461284892
Unequal checksum: Thickness%dzwt(k=0): -3934741698621914372
Unequal checksum: Thickness%dzwt(k=1:nk): -306092990192632628
Unequal checksum: Thickness%dzu: 8895240687531472467
Unequal checksum: Thickness%dzwu(k=0): 6327813631221876380
Unequal checksum: Thickness%dzwu(k=1:nk): 3309385497261750929 [3/511]
Unequal checksum: Thickness%depth_zt: -3045642215283562736
Unequal checksum: Thickness%geodepth_zt: 8337040943737743546
Unequal checksum: Thickness%depth_zu: -5563661097699173755
Unequal checksum: Thickness%depth_zwt: -762755888625758492
Unequal checksum: Thickness%depth_zwu: -5176151996342750639
Unequal checksum: Thickness%mass_en(1): 759811470748328892
Unequal checksum: Thickness%mass_en(2): 1245147370288381158
F
============================================================ FAILURES =============================================================
____________________________________________ TestBitReproducibility.test_restart_repro ____________________________________________
self = <test_bit_reproducibility.TestBitReproducibility object at 0x14ca4da4f1c0>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')
@pytest.mark.checksum
def test_restart_repro(self, output_path: Path, control_path: Path):
"""
Test that a run reproduces across restarts.
"""
# First do two short (1 day) runs.
exp_2x1day = setup_exp(control_path, output_path,
'test_restart_repro_2x1day')
# Reconfigure to a 1 day run.
exp_2x1day.model.set_model_runtime(seconds=86400)
# Now run twice.
exp_2x1day.setup_and_run()
exp_2x1day.force_qsub_run()
# Now do a single 2 day run
exp_2day = setup_exp(control_path, output_path,
'test_restart_repro_2day')
# Reconfigure
exp_2day.model.set_model_runtime(seconds=172800)
# Run once.
exp_2day.setup_and_run()
# Now compare the output between our two short and one long run.
checksums_1d_0 = exp_2x1day.extract_checksums()
checksums_1d_1 = exp_2x1day.extract_checksums(exp_2x1day.output001)
checksums_2d = exp_2day.extract_checksums()
# Use model specific comparision method for checksums
model = exp_2day.model
matching_checksums = model.check_checksums_over_restarts(
long_run_checksum=checksums_2d,
short_run_checksum_0=checksums_1d_0,
short_run_checksum_1=checksums_1d_1
)
if not matching_checksums:
# Write checksums out to file
with open(output_path / 'restart-1d-0-checksum.json', 'w') as file:
json.dump(checksums_1d_0, file, indent=2)
with open(output_path / 'restart-1d-1-checksum.json', 'w') as file:
json.dump(checksums_1d_1, file, indent=2)
with open(output_path / 'restart-2d-0-checksum.json', 'w') as file:
json.dump(checksums_2d, file, indent=2)
> assert matching_checksums
E assert False
../test-code/test/test_bit_reproducibility.py:125: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_restart_repro - assert False
====================================== **1 failed**, 1 passed, 2 deselected in 646.52s (0:10:46) ======================================
Use +restart_repro build
Updated exe to point to pre-release environment in /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-4
from this PR that updates to the newer restart repro
$ grep mom5 /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-4/spack.location
mom5@git.2023.11.09=2023.11.09 /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x
Showing the updated exe paths:
$ git diff HEAD^
diff --git a/config.yaml b/config.yaml index c851966..5d44f70 100644
--- a/config.yaml
+++ b/config.yaml
@@ -41,7 +41,7 @@ submodels:
- name: ocean
model: mom
- exe: /g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/bin/fms_ACCESS-OM.x
+ exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x
Running again, the historical test fails but the restart repro pass, reproducing Joās finding.
Show full output
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 2 deselected / 2 selected
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111582364']
F['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111582399']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111582526']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111582819']
.
============================================================ FAILURES =============================================================
________________________________________ TestBitReproducibility.test_bit_repro_historical _________________________________________
self = <test_bit_reproducibility.TestBitReproducibility object at 0x151923fb6c70>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')
checksum_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf/testing/checksum/historical-3hr-checksum$
json')
@pytest.mark.checksum
def test_bit_repro_historical(self, output_path: Path, control_path: Path,
checksum_path: Path):
"""
Test that a run reproduces historical checksums
"""
# Setup checksum output directory
# NOTE: The checksum output file is used as part of `repro-ci` workflow
output_dir = output_path / 'checksum'
output_dir.mkdir(parents=True, exist_ok=True)
checksum_output_file = output_dir / 'historical-3hr-checksum.json'
if checksum_output_file.exists():
checksum_output_file.unlink()
# Setup and run experiment
exp = setup_exp(control_path, output_path, "test_bit_repro_historical")
exp.model.set_model_runtime()
exp.setup_and_run()
assert exp.model.output_exists()
#Check checksum against historical checksum file
hist_checksums = None
hist_checksums_schema_version = None
if not checksum_path.exists(): # AKA, if the config branch doesn't have a checksum, or the path is misconfigured
hist_checksums_schema_version = exp.model.default_schema_version
else: # we can use the historic-3hr-checksum that is in the testing directory
with open(checksum_path, 'r') as file:
hist_checksums = json.load(file)
# Parse checksums using the same version
hist_checksums_schema_version = hist_checksums["schema_version"]
checksums = exp.extract_checksums(schema_version=hist_checksums_schema_version)
# Write out checksums to output file
with open(checksum_output_file, 'w') as file:
json.dump(checksums, file, indent=2)
> assert hist_checksums == checksums, f"Checksums were not equal. The new checksums have been written to {checksum_output_fil
e}."
E AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/chec
ksum/historical-3hr-checksum.json.
E assert {'output': {'...ion': '1-0-0'} == {'output': {'...ion': '1-0-0'}
E
E Omitting 1 identical items, use -vv to show
E Differing items:
E {'output': {'Advection of u': ['0', '-5944066210705683418'], 'Advection of v': ['0', '-3606245701812142045'], 'Meridional
velocity': ['9051849634365276068', '7718829051214123787'], 'Thickness%depth_st': ['-436572698594795605'], ...}} != {'output': {'Ad
vection of u': ['0', '-5944066163830149791'], 'Advection of v': ['0', '-3606245664043050147'], 'Meridional velocity': ['90518496343
65276068', '7718829052070798169'], 'Thickness%depth_st': ['-436572698594795605'], ...}}
E Use -v to get more diff
../test-code/test/test_bit_reproducibility.py:51: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_bit_repro_historical - AssertionError: Checksums
were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksu...
====================================== 1 failed, 1 passed, 2 deselected in 556.42s (0:09:16) ======================================
Just naively try the most recent pre-release build as nci-openmpi
hadnāt been correctly turned off in the previous build
$ grep mom5 /g/data/vk83/prerelease/apps/spack/0.20/spack/var/spack/environments/access-om2-2024_03_0-5/spack.location
mom5@git.2023.11.09=2023.11.09 /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz
Show details of config diff
$ git diff
diff --git a/config.yaml b/config.yaml
index 5d44f70..d7d1ac3 100644
--- a/config.yaml
+++ b/config.yaml
@@ -41,7 +41,7 @@ submodels:
- name: ocean
model: mom
- exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x
+ exe: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz/bin/fms_ACCESS-OM.x
input:
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/grid_spec.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_hgrid.nc
diff --git a/manifests/exe.yaml b/manifests/exe.yaml
index 3cf5dd2..f934389 100644
--- a/manifests/exe.yaml
+++ b/manifests/exe.yaml
@@ -12,7 +12,7 @@ work/ice/cice_auscom_360x300_24x1_24p.exe:
binhash: 6bff005e04c23c579f37b7b2c0189793
md5: 5e7c7ba864da95cd1329d098f1e47776
work/ocean/fms_ACCESS-OM.x:
- fullpath: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-sg2jw6gpammwdvme5npli7oas7uicj5x/bin/fms_ACCESS-OM.x
+ fullpath: /g/data/vk83/prerelease/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-qji4nlmr6utrribaiyhewe4je6mifguz/bin/fms_ACCESS-OM.x
hashes:
- binhash: 4f791838e696d241e1839f4a60405083
- md5: c44d552cb9131f7ceeeaca975254eb46
+ binhash: d088e1384d7449e15b403154525cf894
+ md5: 960c43c8f2cbd0ca6fc4946034b07f3c
[aph502@gadi-login-02 1deg_jra55_ryf]$ git commit -a -m 'Update mom5 exe to access-om2-2024_03_0-5 pre-release'
[release-1deg_jra55_ryf 55cae2e] Update mom5 exe to access-om2-2024_03_0-5 pre-release
2 files changed, 4 insertions(+), 4 deletions(-)
Backed up output directory
$ mv test-model-repro test-model-repro-bkup4
Run again
$ pytest -s ../test-code/test -m checksum
With exactly the same result as above. Doh!
Confirm with COSIMA build
To rule out any issue with the spack
build, built MOM5 with COSIMA build script.
[Had red-herring with a ābad buildā but details not important, so deleted]
Current build options
Built new COSIMA mom5 executable from scratch with freshly cloned repo (/g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig
) backed up previous
test dir, replaced exe path with this new build and run again
$ mv /scratch/tm70/aph502/test-model-repro /scratch/tm70/aph502/test-model-repro-bkup6
Config diff
$ git diff
diff --git a/config.yaml b/config.yaml
index 6df99cd..62ade77 100644
--- a/config.yaml
+++ b/config.yaml
@@ -41,7 +41,7 @@ submodels:
- name: ocean
model: mom
- exe: /g/data/tm70/aph502/access-om2-release/bitrepro/access-om2/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x
+ exe: /g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x
input:
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/grid_spec.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_hgrid.nc
[aph502@gadi-login-02 1deg_jra55_ryf]$ git commit -a -m 'Original COSIMA build exe, no --repro flag'
[release-1deg_jra55_ryf 2b6cbf5] Original COSIMA build exe, no --repro flag
1 file changed, 1 insertion(+), 1 deletion(-)
We expect the historical test to work, and the restart repro test not
Show full output
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================
platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0
rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro
collected 4 items / 2 deselected / 2 selected
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111593502']
.['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111593610']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111593710']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111593897']
Unequal checksum: Zonal velocity: -747450584393924602
Unequal checksum: Meridional velocity: 5786533594912504816
Unequal checksum: Advection of u: -2427551909895310013
Unequal checksum: Advection of v: 8414702469715514620
Unequal checksum: rho(taup1): -7825281413282575106
Unequal checksum: pressure_at_depth: 1545293312163002545
Unequal checksum: denominator_r: 1030217802578759450
Unequal checksum: drhodT: 7040067001143210402
Unequal checksum: drhodS: -7443012524785806309
Unequal checksum: drhodz_zt: 4102980311895092158
Unequal checksum: temp: 5576915484520666247
Unequal checksum: salt: -5334557539792698373
Unequal checksum: age_global: 1282193799700320580
Unequal checksum: pot_temp: -3952700955953607101
Unequal checksum: frazil: -692106203165952893
Unequal checksum: ending agm_array: -4817981991690884677
Unequal checksum: ending rossby_radius: -313567614221891037
Unequal checksum: ending rossby_radius_raw: 8136091513197178599
Unequal checksum: ending bih_viscosity: -3604182982324579526
Unequal checksum: ending lap_viscosity: 6584646363536773466
Unequal checksum: thickness_sigma: -8986798025844122441
Unequal checksum: eta_t: 3592422156285373700
Unequal checksum: eta_u: 8622206305623958648
Unequal checksum: deta_dt: -4241427670142365519
Unequal checksum: eta_t_bar: -7394049172613969207
Unequal checksum: pbot_t: -5235402108008676132
Unequal checksum: pbot_u: 8416463192834025535
Unequal checksum: anompb: 5973341933699066300
Unequal checksum: ps: 8772801155042670188
Unequal checksum: grad_ps_1: -5830096513531237293
Unequal checksum: grad_ps_2: -8648781909002437020
Unequal checksum: udrho: -2926523761059235683
Unequal checksum: vdrho: -2962865050298577393
Unequal checksum: conv_rho_ud_t: 491504259505012306
Unequal checksum: source: -24801538497346018
Unequal checksum: eta smoother: -24801538497346018
Unequal checksum: eta_nonbouss: 2288190392690979502
Unequal checksum: forcing_u_bt: 1578950095382092263
Unequal checksum: forcing_v_bt: -5668783177147836576
Unequal checksum: Thickness%rho_dzt(taup1): 6728181715574374999
Unequal checksum: Thickness%rho_dzu(taup1): 6253132191740020748
Unequal checksum: Thickness%mass_u(taup1): -4338831531743958235
Unequal checksum: Thickness%rho_dzten(1): 21437755887756755
Unequal checksum: Thickness%rho_dzten(2): -8637438697470860740
Unequal checksum: Thickness%rho_dztr: 7534179586392920215
Unequal checksum: Thickness%rho_dzur: 8496882432575222725
Unequal checksum: Thickness%rho_dzt_tendency: 4303812782867412364
Unequal checksum: Thickness%dzt: 284301941868280081
Unequal checksum: Thickness%dzten(1): -6271744555322450474
Unequal checksum: Thickness%dzten(2): 3494164494412952332
Unequal checksum: Thickness%dztlo: -1575651486725050759
Unequal checksum: Thickness%dztup: -1575717921746418697
Unequal checksum: Thickness%dzt_dst: -2181590507461284892
Unequal checksum: Thickness%dzwt(k=0): -3934741698621914372
Unequal checksum: Thickness%dzwt(k=1:nk): -306092990192632628
Unequal checksum: Thickness%dzu: 8895240687531472467
Unequal checksum: Thickness%dzwu(k=0): 6327813631221876380
Unequal checksum: Thickness%dzwu(k=1:nk): 3309385497261750929
Unequal checksum: Thickness%depth_zt: -3045642215283562736
Unequal checksum: Thickness%geodepth_zt: 8337040943737743546
Unequal checksum: Thickness%depth_zu: -5563661097699173755
Unequal checksum: Thickness%depth_zwt: -762755888625758492
Unequal checksum: Thickness%depth_zwu: -5176151996342750639
Unequal checksum: Thickness%mass_en(1): 759811470748328892
Unequal checksum: Thickness%mass_en(2): 1245147370288381158
F
============================================================ FAILURES ============================================================$
____________________________________________ TestBitReproducibility.test_restart_repro ___________________________________________$
self = <test_bit_reproducibility.TestBitReproducibility object at 0x14ff010ed280>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')
@pytest.mark.checksum
def test_restart_repro(self, output_path: Path, control_path: Path):
"""
Test that a run reproduces across restarts.
"""
# First do two short (1 day) runs.
exp_2x1day = setup_exp(control_path, output_path,
'test_restart_repro_2x1day')
# Reconfigure to a 1 day run.
exp_2x1day.model.set_model_runtime(seconds=86400)
# Now run twice.
exp_2x1day.setup_and_run()
exp_2x1day.force_qsub_run()
# Now do a single 2 day run
exp_2day = setup_exp(control_path, output_path,
'test_restart_repro_2day')
# Reconfigure
exp_2day.model.set_model_runtime(seconds=172800)
# Run once.
exp_2day.setup_and_run()
# Now compare the output between our two short and one long run.
checksums_1d_0 = exp_2x1day.extract_checksums()
checksums_1d_1 = exp_2x1day.extract_checksums(exp_2x1day.output001)
checksums_2d = exp_2day.extract_checksums()
# Use model specific comparision method for checksums
model = exp_2day.model
matching_checksums = model.check_checksums_over_restarts(
long_run_checksum=checksums_2d,
short_run_checksum_0=checksums_1d_0,
short_run_checksum_1=checksums_1d_1
)
if not matching_checksums:
# Write checksums out to file
with open(output_path / 'restart-1d-0-checksum.json', 'w') as file:
json.dump(checksums_1d_0, file, indent=2)
with open(output_path / 'restart-1d-1-checksum.json', 'w') as file:
json.dump(checksums_1d_1, file, indent=2)
with open(output_path / 'restart-2d-0-checksum.json', 'w') as file:
json.dump(checksums_2d, file, indent=2)
> assert matching_checksums
E assert False
../test-code/test/test_bit_reproducibility.py:125: AssertionError
===================================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_restart_repro - assert False
====================================== 1 failed, 1 passed, 2 deselected in 700.52s (0:11:40) ======================================
Which is what we see. Hooray! Without --repro
COSIMA and spack
builds are bit repro. Both reproduce the same historical checksums.
Add --repro option
Ok, so backup again
$ mv /scratch/tm70/aph502/test-model-repro /scratch/tm70/aph502/test-model-repro-bkup7
Copy in COSIMA build executable with --repro
(built from exactly the same directory as the previous one, just added the --repro
option in install.sh
):
/g/data/tm70/aph502/access-om2-release/bitrepro/access-om2-orig/src/mom/exec/nci/ACCESS-OM/fms_ACCESS-OM.x
and run again!
Summary
$ pytest -s ../test-code/test -m checksum
======================================================= test session starts =======================================================platform linux -- Python 3.9.18, pytest-8.0.1, pluggy-1.4.0 rootdir: /g/data/tm70/aph502/access-om2-release/bitrepro collected 4 items / 2 deselected / 2 selected
../test-code/test/test_bit_reproducibility.py ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_bit_repro_histori
cal/1deg_jra55_ryf.o111596226']
F['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111596422']
['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2x1day/1deg_jra55_ryf.o111596649'] ['/scratch/tm70/aph502/test-model-repro/control/1deg_jra55_ryf-test_restart_repro_2day/1deg_jra55_ryf.o111596899']
============================================================ FAILURES =============================================================
________________________________________ TestBitReproducibility.test_bit_repro_historical _________________________________________
self = <test_bit_reproducibility.TestBitReproducibility object at 0x14efc5b24dc0>
output_path = PosixPath('/scratch/tm70/aph502/test-model-repro')
control_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf')
checksum_path = PosixPath('/g/data/tm70/aph502/access-om2-release/bitrepro/1deg_jra55_ryf/testing/checksum/historical-3hr-checksum.json')
@pytest.mark.checksum
def test_bit_repro_historical(self, output_path: Path, control_path: Path,
checksum_path: Path):
"""
Test that a run reproduces historical checksums
"""
# Setup checksum output directory
# NOTE: The checksum output file is used as part of `repro-ci` workflow
output_dir = output_path / 'checksum'
output_dir.mkdir(parents=True, exist_ok=True)
checksum_output_file = output_dir / 'historical-3hr-checksum.json'
if checksum_output_file.exists():
checksum_output_file.unlink()
# Setup and run experiment
exp = setup_exp(control_path, output_path, "test_bit_repro_historical")
exp.model.set_model_runtime()
exp.setup_and_run()
assert exp.model.output_exists()
#Check checksum against historical checksum file
hist_checksums = None
hist_checksums_schema_version = None
if not checksum_path.exists(): # AKA, if the config branch doesn't have a checksum, or the path is misconfigured
hist_checksums_schema_version = exp.model.default_schema_version
else: # we can use the historic-3hr-checksum that is in the testing directory
with open(checksum_path, 'r') as file:
hist_checksums = json.load(file)
# Parse checksums using the same version
hist_checksums_schema_version = hist_checksums["schema_version"]
checksums = exp.extract_checksums(schema_version=hist_checksums_schema_version)
# Write out checksums to output file
with open(checksum_output_file, 'w') as file:
json.dump(checksums, file, indent=2)
> assert hist_checksums == checksums, f"Checksums were not equal. The new checksums have been written to {checksum_output_file}."
E AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksum/historical-3hr-checksum.json.
E assert {'output': {'...ion': '1-0-0'} == {'output': {'...ion': '1-0-0'}
E
E Omitting 1 identical items, use -vv to show
E Differing items:
E {'output': {'Advection of u': ['0', '-5944066210705683418'], 'Advection of v': ['0', '-3606245701812142045'], 'Meridional velocity': ['9051849634365276068', '7718829051214123787'], 'Thickness%depth_st': ['-436572698594795605'], ...}} != {'output': {'Advection of u': ['0', '-5944066163830149791'], 'Advection of v': ['0', '-3606245664043050147'], 'Meridional velocity': ['9051849634365276068', '7718829052070798169'], 'Thickness%depth_st': ['-436572698594795605'], ...}}
E Use -v to get more diff
../test-code/test/test_bit_reproducibility.py:51: AssertionError
===============
====================================== short test summary info =====================================================
FAILED ../test-code/test/test_bit_reproducibility.py::TestBitReproducibility::test_bit_repro_historical - AssertionError: Checksums were not equal. The new checksums have been written to /scratch/tm70/aph502/test-model-repro/checksu...
====================================== 1 failed, 1 passed, 2 deselected in 684.46s (0:11:24) ======================================
SUCCESS! We have consistent behaviour with the spack
+restart_repro
build: historical does not reproduce, but restart repro does.
What is more the the checksums now match between the spack
and COSIMA builds:
$ diff /scratch/tm70/aph502/test-model-repro/checksum/historical-3hr-checksum.json /scratch/tm70/aph502/test-model-repro-bkup5/checksum/historical-3hr-checksum.json
[aph502@gadi-login-02 access-om2]
$
Conclusion
Turning on reproducibility options that maintain restart reproducibility breaks historical reproducibility with previous runs that did not have these options turned on.