Exp. crashing

Hello,
I managed to run a meltwater experiment for 20 years without issue, after which the experiment crashed. I tried to restart it from the previous year, but it crashed again after 17min of run time, and it exited with error code 9.


I am trying to include here the screenshot from the access.err that shows that it was an UM error “overwriting due to dim_e_out size.”
I have been running another experiment, so I don’t think it is a space issue.

Any help appreciated!
Thanks,
Laurie

Hi @LaurieM,

the over-writing due to dim_e_out size error from the UM can unfortunately be difficult to debug as there can be a lot of different causes. It often happens when the UM runs into a random instability, in which case applying a small perturbation to the last atmosphere restart often helps. If the crash keeps reoccurring with different perturbations, it might point to something wrong in the setup.

There’s a script available in /g/data/access for applying perturbations to the atmosphere restart files. To run it, go to the latest archive/restartXYZ/atmosphere directory and run:

# Load dependencies
module use /g/data/xp65/public/modules
module load conda/analysis3

# Make a backup of the restart file
cp restart_dump.astart restart_backup

# Apply perturbation to restart file
/g/data/access/projects/access/apps/pythonlib/umfile_utils/access_cm2/perturbIC.py -s <random-seed> restart_dump.astart

<random-seed> can be a chosen positive integer. It’s worth keeping track of the random seed in case you need to reproduce the perturbation in the future. An optional -a flag can be added to set the amplitude of the perturbation (default 0.01).

Once you’ve added the perturbation, you can try continuing the run to see if it gets around the error. Let me know if you run into any problems.

1 Like