Need help finding and interpreting ACCESS-ESM1-5 errors

I’m running some perturbation experiments with the pre-industrial control in ACCESS-ESM1-5 but since last week a number of them have started crashing during the run and I don’t know why.

I get the error message: “payu: Model exited with error code 139; aborting.”. I can see this has been mentioned in other posts, but I don’t think this is relevant here because I have started these experiments using the PI-02 restarts.

I’d like some help finding and interpreting the errors. An example control directory which has crashed at time step 4525 is here: /home/561/hd4873/PostDoc/ACCESS-ESM/access-esm-payu

The work directory is here: /scratch/e14/hd4873/access-esm/work/access-esm-payu-ocean-warm-upwelling_year780-6213760d

and the error logs are here: /scratch/e14/hd4873/access-esm/archive/access-esm-payu-ocean-warm-upwelling_year780-6213760d.

I think it crashed in the UM but not sure which atmosphere logs give more details. This is a snippet from the access.err file.

[gadi-cpu-clx-0429:974522:0:974522] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
um7.3x             00000000012FCBC4  Unknown               Unknown  Unknown
libpthread-2.28.s  0000154A43EB5D20  Unknown               Unknown  Unknown
mca_pml_ucx.so     0000154A30389E27  mca_pml_ucx_recv      Unknown  Unknown
libmpi.so.40.20.2  0000154A444CAF55  MPI_Recv              Unknown  Unknown
libmpi_mpifh.so    0000154A447CA170  Unknown               Unknown  Unknown
um7.3x             0000000001111A08  mpl_recv_                  67  mpl_recv.F90
um7.3x             000000000110AF66  gc_rrecv_                 168  gc_rrecv.F90
um7.3x             0000000000989C5D  bi_linear_h_              613  bi_linear_h.f90
um7.3x             0000000000CC92BC  ritchie_                 2557  ritchie.f90
um7.3x             00000000009DFB7D  departure_point_          382  departure_point.f90
um7.3x             00000000008AA46F  sl_thermo_                681  sl_thermo.f90
um7.3x             00000000006EC376  ni_sl_thermo_             778  ni_sl_thermo.f90
um7.3x             00000000004BD235  Unknown               Unknown  Unknown
um7.3x             0000000000435BB0  Unknown               Unknown  Unknown
um7.3x             000000000041481E  um_shell_                3930  um_shell.f90
um7.3x             000000000040D968  MAIN__                     40  flumeMain.f90
um7.3x             000000000040D8A2  Unknown               Unknown  Unknown
libc-2.28.so       0000154A439037E5  __libc_start_main     Unknown  Unknown
um7.3x             000000000040D7AE  Unknown               Unknown  Unknown

I know it might sound unlikely, but have you checked if the error occurs again if you simply do a sweep and re-run it?
ACCESS-ESM1.5 has a habit of crashing for unknown reasons, and I find it’s best to first check if the error occurs twice. (Sometimes it just runs fine the second time you try…)

Yep, I’ve tried that. If I recall correctly, it’s crashed at the exact same time step. I can try again now to confirm.

Twice is enough to confirm. Thanks Hannah.

Dr David Hutchinson (he/him)
ARC DECRA fellow in paleoclimate modelling
Climate Change Research Centre, UNSW Sydney
david.hutchinson@unsw.edu.au

yep, confirming it’s crashed at the same spot.

Have you tried perturbing the model using /projects/access/apps/pythonlib/umfile_utils/perturbIC.py. I encountered this error, and I’m not sure why, but perturbing sometimes does the trick.

A crash in the bi_linear_h routine is often the result of the model becoming unstable and if so can fixed with a small perturbation of the atmosphere as @HIMADRI_SAINI suggested.

See also

http://climate-cms.wikis.unsw.edu.au/ACCESS#Coupled_Model_Crashes

1 Like

Thanks for the suggestion @HIMADRI_SAINI. I’ve not tried this. I’ll give it a go.

@Aidan @HIMADRI_SAINI sorry, what’s the syntax for running this? I’ve just tried like so (from the CLEX instructions): /projects/access/apps/pythonlib/umfile_utils/perturbIC.py restart_dump.astart

and I get the following message:

  File "/projects/access/apps/pythonlib/umfile_utils/perturbIC.py", line 23
    print "Usage: perturbIC [-a amplitude] [-v variable (stashcode)] file"
          ^

So I tried copying the script and running perturbIC -arguments file as suggested above (what’s considered a small perturbation by the way?), but no luck there - think I’m getting the syntax wrong.

Hi Hannah,
This seems like a python2 / python3 problem. The perturbIC.py in the example is using the old print method, where print is done without brackets. A couple of ways to deal with this (not sure what Himadri did…):

  • explicitly call python2 interpreter
  • copy the script to a local directory and update to print() with brackets.

Probably the first way is easier because the script has dependencies to the umfile.py and um_fileheaders.py script in the same directory.

By default, the perturbIC script makes a perturbation of order 0.01 to the temperature field.

@dkhutch thanks! I loaded python2/2.1.7 and ran python2 /projects/access/apps/pythonlib/umfile_utils/perturbIC.py restart_dump.astart and I think that’s worked.

3 Likes

There’s also a version updated for python3 available in the pythonlib/umfile_utils/access_cm2 module. It differs in requiring a seed as an argument which makes it reproducible

module use ~access/modules
module load  pythonlib/umfile_utils/access_cm2 
perturbIC.py -h
usage: perturbIC.py [-h] [-a AMPLITUDE] -s SEED ifile

Perturb UM initial dump

positional arguments:
  ifile         Input file (modified in place)

options:
  -h, --help    show this help message and exit
  -a AMPLITUDE  Amplitude of perturbation
  -s SEED       Random number seed (must be non-negative integer)
3 Likes

Great, that worked - thanks all for your help!

What does this mean for publication purposes though? At the moment I’m testing so it’s fine, but if this had to be done for an ensemble that I wanted to publish - is this perturbing method considered okay by the Earth System community if you use the reproducible method that @MartinDix posted? Or is it not really okay to publish runs that have had to be perturbed in this way?

My feeling is that making a small perturbation during spin up like this is not at all problematic for publication. Plenty of models have to do tricks like this… It is basically a non-issue. (If you want to document instances of perturbations like this then great, that’s probably better than what most people do.)

Dr David Hutchinson (he/him)
ARC DECRA fellow in paleoclimate modelling
Climate Change Research Centre, UNSW Sydney
david.hutchinson@unsw.edu.au

This was covered in a previous topic

1 Like

Oops I hadn’t seen that, thank you for linking @Aidan.

1 Like

No worries. Hope it is helpful.