The netCDF postprocessing job has been intermittently failing in ESM1.6 test simulations. @jo-basevi is planning on submitting an nci help ticket on the issue.
To help with the investigation, we’re hoping to record of failures people have encountered and when they happen. Reports of very recent failures would be especially useful, as nci would be able look into recent logs from gadi.
If you run into a netCDF conversion failure, it would be great to attach the relevant error logs here.
Finding logs for failed jobs:
To find the logs for a failing job, from the experiment control directory you can run:
$ grep Exit UM_conversion_job.sh.o*
...
UM_conversion_job.sh.o144939834: Exit Status: 0
UM_conversion_job.sh.o144946933: Exit Status: 0
UM_conversion_job.sh.o144949411: Exit Status: 0
UM_conversion_job.sh.o144953388: Exit Status: 1
UM_conversion_job.sh.o144957035: Exit Status: 0
UM_conversion_job.sh.o144964175: Exit Status: 0
UM_conversion_job.sh.o144967382: Exit Status: 0
...
and then inspect the .e
error log for the failing job. E.g. in UM_conversion_job.sh.e144953388
:
File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/stats/_distn_infrastructure.py", line 26, in <module>
from scipy import integrate
File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/__init__.py", line 134, in __getattr__
return _importlib.import_module(f'scipy.{name}')
File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/__init__.py", line 106, in <module>
from ._ode import *
File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/_ode.py", line 90, in <module>
from . import _dop
ImportError: /g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/scipy/integrate/_dop.cpython-310-x86_64-linux-gnu.so: cannot read file data: Input/output error
The above Input/output errror
crash seems to be the most common, however there have also been occasional walltime failures where jobs takes much longer than usual. It would helpful to keep track of any of these failures.
What to include:
If you encounter a conversion failure, it would be helpful to report:
- The path to the output directory being converted, e.g.
/scratch/tm70/sw6175/access-esm/archive/esm1.6-preind-apr+cice5-continue-80-202504-dev-preindustrial+concentrations+cice5-6a4744ed/output119/
- The path to the PBS error log, e.g:
/home/565/sw6175/esm1.6/simulations/esm1.6-preind-apr+cice5-continue/UM_conversion_job.sh.e144953388
- A description of the type of failure (e.g. walltime or input/output error), and the contents of the PBS error logs.
Thank you!