Transient payu errors

Background

There have recently been some reports of transient errors with payu:

ImportError: /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/lib-dynload/_socket.cpython-310-x86_64-linux-gnu.so: cannot read file data: Input/output error
Detailed stack trace
$ payu run -f
Traceback (most recent call last):
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/bin/payu", line 5, in <module>
    from payu.cli import parse
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/site-packages/payu/cli.py", line 23, in <module>
    from payu.models import index as supported_models
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/site-packages/payu/models/__init__.py", line 4, in <module>
    from payu.models.cesm_cmeps import AccessOm3
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/site-packages/payu/models/cesm_cmeps.py", line 21, in <module>
    from payu.models.fms import fms_collate
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/site-packages/payu/models/fms.py", line 10, in <module>
    import multiprocessing
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/multiprocessing/__init__.py", line 16, in <module>
    from . import context
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/multiprocessing/context.py", line 6, in <module>
    from . import reduction
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/multiprocessing/reduction.py", line 16, in <module>
    import socket
  File "/g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/socket.py", line 51, in <module>
    import _socket
ImportError: /g/data/vk83/prerelease/apps/base_conda/envs/payu-dev-20250501T210937Z-028805b/lib/python3.10/lib-dynload/_socket.cpython-310-x86_64-linux-gnu.so: cannot read file data: Input/output error

The errors are transient, and so not reproducible. For this reason we’re assuming it is not an error with payu, but an issue with gadi, potentially filesystem related.

Solution

Luckily there is simple solution: run your experiment again.

With payu this is accomplished by removing the existing work directory (sweep) and then running the experiment again:

payu sweep
payu run

or

payu run -f

as the -f option does the sweep for you.

If that solution doesn’t work for you please do follow the guidelines to request help

Reporting

If you have this problem please reply to this topic and let us know. If there is a pattern to this problem it could help NCI to track down the source of this problem.

1 Like

Note: there were some other transient errors when module loading payu:

FATAL:   container creation failed: mount /proc/self/fd/10->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/overlay-images/0 error: while mounting image /proc/self/fd/10: failed to find loop device: could not attach image file to loop device: failed to attach loop device: transient error, please retry: resource temporarily unavailable

and

FATAL:   container creation failed: mount /proc/self/fd/3->/opt/nci/singularity/3.11.3/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to mount squashfs filesystem: file exists

If you have this error please do report below. There has been a fix added to the payu container environment in the payu/dev prerelease module, but it isn’t available in the release payu environments until version 1.1.7 is released.