How to preserve reproducibility when applying perturbations in the event of a numerical instability crash

I’ve run an ACCESS-ESM1.5 simulation that has crashed. It’s a pre-industrial run with interactive carbon cycle enabled, except I’ve altered the land-cover such that there are no crops. It ran for 26 years before failing, so I doubt my modifications to the UM restart file are the cause.

The backtrace is as follows:

[gadi-cpu-clx-1378:253750:0:253750] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 253750) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000bf1fbf interpolation_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/atmosphere/dynamics_advection/interpolation.f90:1010
 2 0x0000000000cc9d2b ritchie_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/atmosphere/dynamics_advection/ritchie.f90:2626
 3 0x00000000009dfb7d departure_point_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/atmosphere/dynamics_advection/departure_point.f90:382
 4 0x00000000008aa46f sl_thermo_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/atmosphere/dynamics_advection/sl_thermo.f90:681
 5 0x00000000006ec376 ni_sl_thermo_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/atmosphere/dynamics_advection/ni_sl_thermo.f90:778
 6 0x00000000004bd235 atm_step_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/control/top_level/atm_step.f90:10423
 7 0x0000000000435bb0 u_model_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/control/top_level/u_model.f90:5008
 8 0x000000000041481e um_shell_()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/control/top_level/um_shell.f90:3930
 9 0x000000000040d968 MAIN__()  /scratch/p66/txz599/UM/UM_ACCESS-ESM1p5_r343/submodels/UM/ummodel_hg3/ppsrc/control/top_level/flumeMain.f90:40
10 0x000000000040d8a2 main()  ???:0
11 0x000000000003ad85 __libc_start_main()  ???:0
12 0x000000000040d7ae _start()  ???:0
=================================

I’m told by Tilo and @RachelLaw that this is probably just a numerical instability error that arises from an unfortunate set of conditions in just a few grid cells (a “grid cell storm” apparently). While it seems to seldom occur under normal circumstances, it’s not unusual for the ESM1.5 version of the UM. Jhan mentioned that later version of the UM are less susceptible to this problem. But since I’m stuck using ESM1.5 for now, workarounds will do.

According to the ARCCSS CMS wiki, the workaround is to restart the model from a previous state with some small perturbations to avoid the grid cell storm. This seems to have worked for me, but it creates some problems with reproducibility of the simulation.

I’ve been using payu, but the procedure to apply the perturbations is manual and completely external to payu or its configuration. Does anyone have any suggestions on how to keep track of when this perturbation script is applied, so that simulations can be accurately reproduced if this problem does occur? Should I call the script in pre.sh or should it be in a separate script? It seems like payu can’t handle multiple scripts in the setup field of config.yaml, because it wouldn’t execute them when I tried.

Also @MartinDix, the perturbations are noise generated by np.random.random(). This means that running the script again will produce different perturbations and so would make the run unreproducible (I still have the perturbed restart file but if I lose that, then it’s unreproducible). There is a comment in the perturbation script that says a seed should probably be set, but it hasn’t.
What seed number should be passed to np.random.default_rng() to make this reproducible? The restart file’s year? or something else?

I’ve done the above for my simulation, but I’ll leave it up to the NRI to decide the best solution to this.

1 Like

~access/apps/pythonlib/umfile_utils/perturbIC.py is outdated and not the version I use myself with CM2.

At some stage I added a seed to get reproducibility (for the reasons you found) but didn’t copy this to the version in ~access (python2 vs python3 issues). I need to work out some module versioning and then I’ll update it.

I normally just use the default seed and if it still fails increase the amplitude. I don’t have a good systematic way of recording this, just noting it in my diary file.

Hi Tammas

payu keeps a ‘manifest’ – a list of the ancillary files as well as their checksums. This is added to the (local) git repository automatically. By local, I mean the .git subdirectory under the directory containing the config.yaml file – so if you delete that folder, the history is gone.

I’m fairly certain that this would pick up when and where the perturbation happened. If you want to be more specific, you can always commit a manual change: Make the changes, run

git commit -a

This will open an editor where you can leave a very explicit description of what you did in the commit log together with the changes you made.

The manifest and the git history do not track the ancillary files directly, only their name and checksums. If you need to preserve bit-reproducibility, you want to add the specific seed you used, and maybe even copy the file with a new name somewhere else. The script takes a specific seed with the -s <number> option:

Usage: perturbIC [-a amplitude] [-v variable (stashcode)] [-s seed] file

I would also like @Aidan to have a quick read through my answer to check whether I made some obvious mistake here. He knows payu better than me.

Does this answer your questions?

Cheers
Holger

1 Like

I’m confused. The script that Tammas shared definitely contains an option to set a seed: [-s seed] and it seems to me from looking at the code that this setting is actually acted on. Why do you think it’s not?

Is this code in a public GitHub repository @MartinDix?

The one on ~access does not. The one I shared above is a version I altered to set a seed and I added that my own setup. The open question is how it should be run. I had thought it should probably run automatically on the year that the crash occurs. But payu doesn’t seem to like multiple scripts being passed to setup in config.yaml.

For example, I can’t do something like:

userscripts:
    # Setup land-use changes.
    setup: |
        ./scripts/pre.sh /g/data/p66/tfl561/sensitivity_lu_map/1850_no_humans_CABLE_fraction.nc
        ./scripts/run_perturbIC.sh

The other question around what the seed should be probably doesn’t matter so much. I just set it to be the year the crash happens.

Tammas’s version has the seed. The access module version referred to on the CMS page doesn’t.

I can’t see any issue with what you’ve suggested @holger.

To be explicit, I would do as @holger suggests, with these specific steps

  1. Run the perturbIC.pywith known seed
  2. Then payu setup, which will rewrite the manifest file with your new (perturbed) restart(s)
  3. git commit -a and write a commit message documenting the steps you have taken to perturb the restarts, with the seed value and the location of the script used
  4. payu run

You shouldn’t need to invoke a specific script with userscripts. If the run reproduces then it will reliably crash at the same location, and checking the git log will then give the instructions for what was done and how to reproduce it.

1 Like

Thanks, folks!

Is there any plan to create a ACCESS-NRI FAQ or troubleshooting document for common problems like this? Or will the ARCCSS CMS wiki fill this purpose in the future?

1 Like

It’s a timely question @tammasloughran. We discussed this topic at an ACCESS-NRI meeting yesterday because your topic was something that definitely needs to be captured in documentation in some way.

We decided to start using the knowledge base plugin for discourse. I have created a first topic for this describing what the knowledge base is, how it works and how to contribute:

I’d like to use this topic as a guinea pig. Can you read that guide and let me know if it is clear if you should add this topic to the knowledge base, and if so how you should go about doing so?

Thanks!

As a layman, I was a bit confused by some of the overly formal phrasing. For example, what is “knowledge-base functionality”? Some newfangled framework or technology? Even the Discourse documentation takes this to an even further extreme: “easier surfacing of knowledge-base style topics across a defined set of categories and/or tags.” I think all of this should be stated in more simple terms. The last paragraph is more like what I was expecting, so it should be first at the top of the Introduction. Something like “The knowlegde-base is an easy-to-find collection of important forum topics.”

Then, to get more detail, the “What should be added to the knowledge base?” should immediately follow the Introduction, rather than “What is shown in the knowledge base?” which flows more nicely into the next section. Or just merge the what should be added and what is shown sections together.

What is the review process for content in the knowledge-base? When and where will that be done? And by whom?

Back to this topic. So I could either make a new topic with the full problem and solution and add that, or I could edit the top post to include the solution and add this topic to the knowledge-base. Is the latter not an option here? The guide does not make that clear.

1 Like

Ah actually, I somehow missed the part where solutions are included in the knowledge-base. I think the former is good enough.

Thanks for the great feedback @tammasloughran. I’ve made some changes in line with your suggestions. It is a big improvement I think. Let me know if you still think it isn’t clear enough.

Good question. I have also updated the topic to make it clear that currently there is no process. I could make something up, but it would be fiction, as I haven’t talked to anyone about how this would work, and we currently don’t have the resources to commit to ACCESS-NRI doing this by ourselves.

Ultimately we’ll rely on the community to do this, and the community will have to contribute wikipedia style I think.

Hi Tammas,

I was wondering if you would be happy to run through the steps you take to get your version of the script working on Gadi? I’m working through some model crashes and so it would be great to use your version which lets you set a random seed.

I’ve copied the script over, but think I’m getting problems with python and numpy versions. When I load only the pythonlib/umfile_utils module, I receive an error about the ‘default_rng’ attribute not existing, while if I additionally load the conda/analysis3 module, numpy fails to import.

Many thanks,
Spencer

Just load the default python3 module, that should have a more recent version of numpy. For the rest of the umfile libraries, I just put them in the same directory.

module load python3
cd scripts
# Make sure that all the umfile manipulation scripts are in the same directory also.
ls
cicedumpdatemodify.py  pre.sh       README.md            set_restart_year.sh  um_fileheaders.pyc  umfile.pyc               update_um_year.py  warm-start-csiro.sh
perturbIC.py           __pycache__  restart_dump.astart  um_fileheaders.py    umfile.py           update_cable_vegfrac.py  utils.sh           warm-start-payu.sh

python3 perturbIC.py -s 2023 restart_dump.astart
1 Like

Hi Spencer

I had no problem importing numpy with the conda/analysis3 module loaded. Which error message did you get?

Holger

Thanks Tammas! I’ll give this a go

Hi Holger,

when I try to import numpy with both the umfile_utils and conda/analysis3 modules loaded, I get the following error:

import numpy
Traceback (most recent call last):
  File "/apps/python2/2.7.16/lib/python2.7/site-packages/numpy-1.16.5-py2.7-linux-x86_64.egg/numpy/core/__init__.py", line 40, in <module>
    from . import multiarray
  File "/apps/python2/2.7.16/lib/python2.7/site-packages/numpy-1.16.5-py2.7-linux-x86_64.egg/numpy/core/multiarray.py", line 13, in <module>
    from . import overrides
  File "/apps/python2/2.7.16/lib/python2.7/site-packages/numpy-1.16.5-py2.7-linux-x86_64.egg/numpy/core/overrides.py", line 6, in <module>
    from numpy.core._multiarray_umath import (
ImportError: dynamic module does not define module export function (PyInit__multiarray_umath)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/apps/python2/2.7.16/lib/python2.7/site-packages/numpy-1.16.5-py2.7-linux-x86_64.egg/numpy/__init__.py", line 142, in <module>
    from . import core
  File "/apps/python2/2.7.16/lib/python2.7/site-packages/numpy-1.16.5-py2.7-linux-x86_64.egg/numpy/core/__init__.py", line 71, in <module>
    raise ImportError(msg)
ImportError: 

...

Original error was: dynamic module does not define module export function (PyInit__multiarray_umath)

It looks like it’s trying to load a python 2 version on numpy here.

Thanks!
Spencer

Yeah, it’s running the python2 interpreter, not the one from the conda/analysis environment.

Either you don’t have that one loaded, or some other module loaded later has overwritten which python interpreter to use.

You can use the command:

$ which python
/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python
$ which python3
/g/data/hh5/public/apps/miniconda3/envs/analysis3-22.10/bin/python3

to find out which interpreter it’s using. Or you can manually select the python interpreter of your choice by running

python3 ~access/apps/pythonlib/umfile_utils/perturbIC.py [-a amplitude] [-v variable (stashcode)] [-s seed] file

Holger

1 Like

The umfile_utils module depends on python2. I don’t understand why, I’m not sure there is any python2 specific code in it. Maybe one day someone can add a umfile_utils library module for python3.