Reporting ESM1.6 netCDF conversion failures

The netCDF postprocessing job has been intermittently failing in ESM1.6 test simulations. @jo-basevi is planning on submitting an nci help ticket on the issue.

To help with the investigation, we’re hoping to record of failures people have encountered and when they happen. Reports of very recent failures would be especially useful, as nci would be able look into recent logs from gadi.

If you run into a netCDF conversion failure, it would be great to attach the relevant error logs here.

Finding logs for failed jobs:

To find the logs for a failing job, from the experiment control directory you can run:

$ grep Exit UM_conversion_job.sh.o*
...
UM_conversion_job.sh.o144939834:   Exit Status:        0
UM_conversion_job.sh.o144946933:   Exit Status:        0
UM_conversion_job.sh.o144949411:   Exit Status:        0
UM_conversion_job.sh.o144953388:   Exit Status:        1
UM_conversion_job.sh.o144957035:   Exit Status:        0
UM_conversion_job.sh.o144964175:   Exit Status:        0
UM_conversion_job.sh.o144967382:   Exit Status:        0
...

and then inspect the .e error log for the failing job. E.g. in UM_conversion_job.sh.e144953388:

  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/stats/_distn_infrastructure.py", line 26, in <module>
    from scipy import integrate
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/__init__.py", line 134, in __getattr__
    return _importlib.import_module(f'scipy.{name}')
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/__init__.py", line 106, in <module>
    from ._ode import *
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/_ode.py", line 90, in <module>
    from . import _dop
ImportError: /g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/_dop.cpython-310-x86_64-linux-gnu.so: cannot read file data: Input/output error

The above Input/output errror crash seems to be the most common, however there have also been occasional walltime failures where jobs takes much longer than usual. It would helpful to keep track of any of these failures.

What to include:

If you encounter a conversion failure, it would be helpful to report:

  1. The path to the output directory being converted, e.g. /scratch/tm70/sw6175/access-esm/archive/esm1.6-preind-apr+cice5-continue-80-202504-dev-preindustrial+concentrations+cice5-6a4744ed/output119/
  2. The path to the PBS error log, e.g: /home/565/sw6175/esm1.6/simulations/esm1.6-preind-apr+cice5-continue/UM_conversion_job.sh.e144953388
  3. A description of the type of failure (e.g. walltime or input/output error), and the contents of the PBS error logs.

Thank you!

  1. /scratch/p66/rml599/access-esm/archive/test-sla-test-sla-8018eb87/output005
  2. /g/data/p66/rml599/amip-test/test-sla/UM_conversion_job.sh.e144960044
  3. ImportError: /g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/stats/_stats.cpython-310-x86_64-linux-gnu.so: cannot read file data: Input/output error
2 Likes

4 instances of conversion failure between output618 and output727 in our current JuneSpinUp:

Path to output is /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/

  • output633 - no files created [edit: typo]
  • output644 - partial success
  • output654 - partial success
  • output675 - partial success

More details to be added.

edit: no additional conversion failures apparent between years 728-785 (as of 31/7/2025)

edits2: no additional conversion failures apparent between years 786-863 (as of 4/8/2025)

1 Like

A conversion failure from yesterday during a test run

Path to output is /scratch/tm70/sw6175/access-esm/archive/ibergmask+UM7.7-basin+alt-basin-outflow+umerror-balance+river-balance+LVAP-fix-add-rivfix+lvap-fix-f1cb6158/output034

Path to error log is /scratch/tm70/sw6175/access-esm/archive/ibergmask+UM7.7-basin+alt-basin-outflow+umerror-balance+river-balance+LVAP-fix-add-rivfix+lvap-fix-f1cb6158/output034/UM_conversion_job.sh.e146109112

Error message:

ImportError: /g/data/vk83/apps/base_conda/envs/payu-1.1.7/lib/python3.10/site-packages/netCDF4/../../../././libzstd.so.1: cannot read file data: Input/output error

Two new failures in the June-Spinup this week (8/8/2025):

Paths to output are

  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output912
  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output917

It appears that there is no UM_conversion log for either year - so perhaps look at theaccess-esm1.6.err files in those directories.

More conversion failures in the two Spinups over the week-end (to 11/8/2025)

Paths to output are - in the June-Spinup:

  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output942
  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output944
  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output959
  • /scratch/p66/jxs599/access-esm/archive/JuneSpinUp-JuneSpinUp-bfaa9c5b/output978

Path to PBS logs for these (and all other JuneSpinup failures) is

  • /home/599/jxs599/ESM16/PAYU/Dev/JuneSpinUp

In the new August-Spinup:

  • /scratch/p66/jxs599/access-esm/archive/AugustSpinUp-Jhan-dev-20250808-1-d1b0b669/output031
  • /scratch/p66/jxs599/access-esm/archive/AugustSpinUp-Jhan-dev-20250808-1-d1b0b669/output040

Path to PBS error logs for these is

  • /home/599/jxs599/ESM16/PAYU/Dev/AugustSpinUp
  • for output031: UM_conversion_job.sh.o146766402:Maximum number of 2 singularity invocation attempts reached. Exiting
  • for output040: UM_conversion_job.sh.o146787545: Exit Status: -29 (Job failed due to exceeding walltime)

Note that output040 (UM_conversion_job.sh.e146787545) failed because it

  1. couldn’t find /g/data/hr22/modulefiles
  2. was unable to locate a modulefile for ‘cylc7/23.09-cdev’
  3. Loading requirement: singularity =>> PBS: job killed: walltime 2405 exceeded limit 2400

output031 (UM_conversion_job.sh.e146766402) was more interesting - first two failures are common then

  1. couldn’t find /g/data/hr22/modulefiles
  2. was unable to locate a modulefile for ‘cylc7/23.09-cdev’
  3. two instances of FATAL: container creation failed: with further information that while mounting image /proc/self/fd/11: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource
    temporarily unavailable
1 Like