I’ve run an ACCESS-ESM1.5 simulation that has crashed. It’s a pre-industrial run with interactive carbon cycle enabled, except I’ve altered the land-cover such that there are no crops. It ran for 26 years before failing, so I doubt my modifications to the UM restart file are the cause.
I’m told by Tilo and @RachelLaw that this is probably just a numerical instability error that arises from an unfortunate set of conditions in just a few grid cells (a “grid cell storm” apparently). While it seems to seldom occur under normal circumstances, it’s not unusual for the ESM1.5 version of the UM. Jhan mentioned that later version of the UM are less susceptible to this problem. But since I’m stuck using ESM1.5 for now, workarounds will do.
According to the ARCCSS CMS wiki, the workaround is to restart the model from a previous state with some small perturbations to avoid the grid cell storm. This seems to have worked for me, but it creates some problems with reproducibility of the simulation.
I’ve been using payu, but the procedure to apply the perturbations is manual and completely external to payu or its configuration. Does anyone have any suggestions on how to keep track of when this perturbation script is applied, so that simulations can be accurately reproduced if this problem does occur? Should I call the script in pre.sh or should it be in a separate script? It seems like payu can’t handle multiple scripts in the setup field of config.yaml, because it wouldn’t execute them when I tried.
Also @MartinDix, the perturbations are noise generated by np.random.random(). This means that running the script again will produce different perturbations and so would make the run unreproducible (I still have the perturbed restart file but if I lose that, then it’s unreproducible). There is a comment in the perturbation script that says a seed should probably be set, but it hasn’t.
What seed number should be passed to np.random.default_rng() to make this reproducible? The restart file’s year? or something else?
I’ve done the above for my simulation, but I’ll leave it up to the NRI to decide the best solution to this.
(Martin Dix ACCESS-NRI Associate Director for Model Development)
~access/apps/pythonlib/umfile_utils/perturbIC.py is outdated and not the version I use myself with CM2.
At some stage I added a seed to get reproducibility (for the reasons you found) but didn’t copy this to the version in ~access (python2 vs python3 issues). I need to work out some module versioning and then I’ll update it.
I normally just use the default seed and if it still fails increase the amplitude. I don’t have a good systematic way of recording this, just noting it in my diary file.
payu keeps a ‘manifest’ – a list of the ancillary files as well as their checksums. This is added to the (local) git repository automatically. By local, I mean the .git subdirectory under the directory containing the config.yaml file – so if you delete that folder, the history is gone.
I’m fairly certain that this would pick up when and where the perturbation happened. If you want to be more specific, you can always commit a manual change: Make the changes, run
git commit -a
This will open an editor where you can leave a very explicit description of what you did in the commit log together with the changes you made.
The manifest and the git history do not track the ancillary files directly, only their name and checksums. If you need to preserve bit-reproducibility, you want to add the specific seed you used, and maybe even copy the file with a new name somewhere else. The script takes a specific seed with the -s <number> option:
I’m confused. The script that Tammas shared definitely contains an option to set a seed: [-s seed] and it seems to me from looking at the code that this setting is actually acted on. Why do you think it’s not?
The one on ~access does not. The one I shared above is a version I altered to set a seed and I added that my own setup. The open question is how it should be run. I had thought it should probably run automatically on the year that the crash occurs. But payu doesn’t seem to like multiple scripts being passed to setup in config.yaml.
I can’t see any issue with what you’ve suggested @holger.
To be explicit, I would do as @holger suggests, with these specific steps
Run the perturbIC.pywith known seed
Then payu setup, which will rewrite the manifest file with your new (perturbed) restart(s)
git commit -a and write a commit message documenting the steps you have taken to perturb the restarts, with the seed value and the location of the script used
You shouldn’t need to invoke a specific script with userscripts. If the run reproduces then it will reliably crash at the same location, and checking the git log will then give the instructions for what was done and how to reproduce it.
As a layman, I was a bit confused by some of the overly formal phrasing. For example, what is “knowledge-base functionality”? Some newfangled framework or technology? Even the Discourse documentation takes this to an even further extreme: “easier surfacing of knowledge-base style topics across a defined set of categories and/or tags.” I think all of this should be stated in more simple terms. The last paragraph is more like what I was expecting, so it should be first at the top of the Introduction. Something like “The knowlegde-base is an easy-to-find collection of important forum topics.”
Then, to get more detail, the “What should be added to the knowledge base?” should immediately follow the Introduction, rather than “What is shown in the knowledge base?” which flows more nicely into the next section. Or just merge the what should be added and what is shown sections together.
What is the review process for content in the knowledge-base? When and where will that be done? And by whom?
Back to this topic. So I could either make a new topic with the full problem and solution and add that, or I could edit the top post to include the solution and add this topic to the knowledge-base. Is the latter not an option here? The guide does not make that clear.
Thanks for the great feedback @tammasloughran. I’ve made some changes in line with your suggestions. It is a big improvement I think. Let me know if you still think it isn’t clear enough.
Good question. I have also updated the topic to make it clear that currently there is no process. I could make something up, but it would be fiction, as I haven’t talked to anyone about how this would work, and we currently don’t have the resources to commit to ACCESS-NRI doing this by ourselves.
Ultimately we’ll rely on the community to do this, and the community will have to contribute wikipedia style I think.