Restart problem (atmosphere)

Hi,

My latest attempt at running ACCESS-ESM1.5 failed and threw this error (in the stdout file):

payu: error: Model has not produced a restart dump file:
/scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/atmosphere/aiihca.da59110 does not exist.
Check DUMPFREQim in namelists

The atmosphere namelist sets dumpfreqim = -9999, 0, 0, 0.

I am a bit surprised because the almost* exact same run finished smoothly many times over the last few days, including one this morning. For the context, I am applying Anderson Acceleration to spin up the water age, and for this purpose I am repeatedly running the 1850s (so 10-year runs) using the ACCESS-NRI historical config but replacing the water age restart file with my own β€œaccelerated” age each time. So I am not modifying anything except the water age restart file for starting year 1850.

Could this have to do with the recent payu update? Is this similar to @dkhutch’s Ice restart problem with payu/1.1.6? I’m new to running ACCESS-ESM1.5 and pretty confused as to what exactly is happening, and how to fix it, so it would be greatly appreciated to get some guidance (run dir is /home/561/bp3051/access-esm1.5/andersonacceleration_test).

FWIW, the traceback from the stderr file is:

Traceback (most recent call last):
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/shutil.py", line 805, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/ocean/RESTART' -> '/scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/restart008/ocean'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/bin/payu-run", line 10, in <module>
    sys.exit(runscript())
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/payu/subcommands/run_cmd.py", line 135, in runscript
    expt.archive(force_prune_restarts=run_args.force_prune_restarts)
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/payu/experiment.py", line 799, in archive
    model.archive()
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/site-packages/payu/models/fms.py", line 250, in archive
    shutil.move(self.work_restart_path, self.restart_path)
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/shutil.py", line 825, in move
    copy_function(src, real_dst)
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/shutil.py", line 434, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/g/data/vk83/apps/base_conda/envs/payu-1.1.6/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/ocean/RESTART'

Maybe related side question about payu versions: I am confused that both payu v1.1.5 and v1.1.6 appear to be used at the same time. In the same stderr file, a few lines before the traceback, I see:

/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 

Is that supposed to happen? Did I mess up some configuration somewhere?

Hi Benoit, I think I have an idea of why there’s different payu versions, - mule isn’t used in the payu source code so it is being imported in a user-script defined in the configuration. For the release historical configuration, there is a python user-script that is using the payu/1.1.5 python executable so I’ll create an issue to update that.

In terms of the model error, I am not too sure why there is missing restart files in the work directory sorry. I’ll ask around with those who have more experience running these configurations

1 Like

For a coupled run, make sure all components completed successfully in the previous cycle and ran to the expected end date. The oasis logs may be helpful, nothing should be left waiting to send or receive any fields.

1 Like

@Scott Just to be sure, what you call the oasis logs are the log files with a β€œ_c”? If yes, I’m not sure what to look for, but I did find one error in one stdout log file, copied in its entirety below:

ERROR: Could not execv /g/data/vk83/./apps/conda_scripts/payu-1.1.6.d/bin/python! ret=-1 errno=2

======================================================================================
                  Resource Usage on 2025-02-27 12:36:02:
   Job Id:             136022097.gadi-pbs
   Project:            xv83
   Exit Status:        1
   Service Units:      0.00
   NCPUs Requested:    1                      NCPUs Used: 1               
                                           CPU Time Used: 00:00:00        
   Memory Requested:   4.0GB                 Memory Used: 7.02MB          
   Walltime requested: 01:00:00            Walltime Used: 00:00:01        
   JobFS requested:    100.0GB                JobFS used: 0B              
======================================================================================

However, it seems that my payu job error happened between runs 8 and 9 (I submit a single payu job with 10 one-year runs), while this coupler log error happened earlier (run 3). Could it still be the culprit even though ACCESS-ESM1.5 seems to have run fine for another 4–5 years after that coupler error?

The oasis logs should be ending with *.prt0000, though people familiar with ACCESS-ESM can probably confirm. There will be files for each of the components - if there are different numbers at the end these represent the different MPI ranks, you only need to look at rank 0 for each component.

You want these files to be ending with something like this to show the component exited correctly (may not be exactly the same, I’m referencing a different model here):

Tabulating mpp_clock statistics across    1 PEs...

                                          tmin          tmax          tavg          tstd  tfrac
Total runtime                      1647.657989   1647.657989   1647.657989      0.000000  1.000
 MPP_STACK high water mark=           0
 | | | Leaving : psmile_io_cleanup

 lg_mpiflag= F
Called MPI_Finalize in prism_terminate ...
| | | Leaving prism_terminate_proto - exit status <mpi   0>

If one of the component’s logs end with something like this the model has become deadlocked, with one component waiting on data that the other components are not providing:

 | | | Entering prism_get_proto_r28 for field           28
Get - ohicn02
Get - <from: 2> <step:   594000> <len: 139968> <type: 8> <tag:   8388209>

I must admit I don’t know where these log files are.

In my work directory (/scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621), I have these files:

.
β”œβ”€β”€ access.err
β”œβ”€β”€ access.out
β”œβ”€β”€ atmosphere
β”‚   β”œβ”€β”€ atm.fort6.pe0
β”‚   β”œβ”€β”€ cable.nml
β”‚   β”œβ”€β”€ debug.root.01
β”‚   β”œβ”€β”€ errflag
β”‚   β”œβ”€β”€ hnlist
β”‚   β”œβ”€β”€ ihist
β”‚   β”œβ”€β”€ input_atm.nml
β”‚   β”œβ”€β”€ namelists
β”‚   β”œβ”€β”€ nout.000000
β”‚   β”œβ”€β”€ prefix.PRESM_A
β”‚   β”œβ”€β”€ STASHC
β”‚   β”œβ”€β”€ UAFILES_A
β”‚   β”œβ”€β”€ UAFLDS_A
β”‚   └── um_env.yaml
β”œβ”€β”€ config.yaml
β”œβ”€β”€ coupler
β”‚   β”œβ”€β”€ a2i.nc
β”‚   β”œβ”€β”€ areas.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/grids/global.oi_1deg.a_N96/2020.05.19/areas.nc
β”‚   β”œβ”€β”€ grids.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/grids/global.oi_1deg.a_N96/2020.05.19/grids.nc
β”‚   β”œβ”€β”€ i2a.nc
β”‚   β”œβ”€β”€ masks.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/grids/global.oi_1deg.a_N96/2020.05.19/masks.nc
β”‚   β”œβ”€β”€ namcouple
β”‚   β”œβ”€β”€ o2i.nc
β”‚   β”œβ”€β”€ rmp_cice_to_um1t_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_cice_to_um1t_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_cice_to_um1u_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_cice_to_um1u_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_cice_to_um1v_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_cice_to_um1v_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_um1t_to_cice_CONSERV_DESTAREA.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_um1t_to_cice_CONSERV_DESTAREA.nc
β”‚   β”œβ”€β”€ rmp_um1t_to_cice_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_um1t_to_cice_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_um1u_to_cice_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_um1u_to_cice_CONSERV_FRACNNEI.nc
β”‚   └── rmp_um1v_to_cice_CONSERV_FRACNNEI.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/coupler/remapping_weights/global.oi_1deg.a_N96/2020.05.19/rmp_um1v_to_cice_CONSERV_FRACNNEI.nc
β”œβ”€β”€ env.yaml
β”œβ”€β”€ ice
β”‚   β”œβ”€β”€ a2i.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/a2i.nc
β”‚   β”œβ”€β”€ areas.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/areas.nc
β”‚   β”œβ”€β”€ cice_access_360x300_12x1_12p.exe -> /g/data/vk83/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v4/intel-19.0.3.199/cice4-git.2024.05.21_access-esm1.5-hhtnigwxdyz7ta4dv3gvhwulze6hxqra/bin/cice_access_360x300_12x1_12p.exe
β”‚   β”œβ”€β”€ cice_in.nml
β”‚   β”œβ”€β”€ grids.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/grids.nc
β”‚   β”œβ”€β”€ HISTORY
β”‚   β”œβ”€β”€ i2a.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/i2a.nc
β”‚   β”œβ”€β”€ INPUT
β”‚   β”‚   β”œβ”€β”€ grid.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/ice/grids/global.1deg/2020.05.19/grid.nc
β”‚   β”‚   β”œβ”€β”€ kmt.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/ice/grids/global.1deg/2020.05.19/kmt.nc
β”‚   β”‚   └── monthly_sstsss.nc -> /g/data/vk83/configurations/inputs/access-esm1p5/modern/share/ice/climatology/global.1deg/2020.05.19/monthly_sstsss.nc
β”‚   β”œβ”€β”€ input_ice.nml
β”‚   β”œβ”€β”€ masks.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/masks.nc
β”‚   β”œβ”€β”€ namcouple -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/namcouple
β”‚   β”œβ”€β”€ o2i.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/o2i.nc
β”‚   β”œβ”€β”€ RESTART
β”‚   β”‚   β”œβ”€β”€ cice_in.nml
β”‚   β”‚   β”œβ”€β”€ iced.18580101
β”‚   β”‚   β”œβ”€β”€ ice.restart_file
β”‚   β”‚   β”œβ”€β”€ input_ice.nml
β”‚   β”‚   β”œβ”€β”€ mice.nc
β”‚   β”‚   β”œβ”€β”€ o2i.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/ice/o2i.nc
β”‚   β”‚   β”œβ”€β”€ README
β”‚   β”‚   └── restart_date.nml
β”‚   β”œβ”€β”€ rmp_cice_to_um1t_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_cice_to_um1t_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_cice_to_um1u_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_cice_to_um1u_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_cice_to_um1v_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_cice_to_um1v_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_um1t_to_cice_CONSERV_DESTAREA.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_um1t_to_cice_CONSERV_DESTAREA.nc
β”‚   β”œβ”€β”€ rmp_um1t_to_cice_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_um1t_to_cice_CONSERV_FRACNNEI.nc
β”‚   β”œβ”€β”€ rmp_um1u_to_cice_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_um1u_to_cice_CONSERV_FRACNNEI.nc
β”‚   └── rmp_um1v_to_cice_CONSERV_FRACNNEI.nc -> /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621/coupler/rmp_um1v_to_cice_CONSERV_FRACNNEI.nc
β”œβ”€β”€ job.yaml
β”œβ”€β”€ manifests
β”‚   β”œβ”€β”€ exe.yaml
β”‚   β”œβ”€β”€ input.yaml
β”‚   └── restart.yaml
└── ocean
    β”œβ”€β”€ data_table
    β”œβ”€β”€ debug.root.02
    β”œβ”€β”€ debug.root.03
    β”œβ”€β”€ diag_table
    β”œβ”€β”€ field_table
    β”œβ”€β”€ input.nml
    β”œβ”€β”€ logfile.000000.out
    └── time_stamp.out

Not sure if this is useful information, but the access.err file starts with

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 381 in communicator MPI_COMM_WORLD
with errorcode 0.

Hi @Benoit,

This seems like quite a strange thing that’s come up. Would you be able to share the contents of the latest outputXYZ and restartXYZ folders within the archive directory? I’m wondering whether it will help us understand what payu had already done, and what files it had moved.

1 Like

Hi @spencerwong: Yes, of course, although I’m not sure how best to share them. AFAIU the latests output and restart files are in

/scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621

Should I just copy that directory somewhere else if you can’t access xv83?

Oh but maybe you just meant the list of files, which is huge actually, most of which in output003/ (about 35k files instead of 300!), which I guess is the output for the run when the collating script threw an error and aborted.

So maybe the issue is that not collating after run 3 made the job hit a hard limit on number of files later, on run 8 or 9?

(EDIT: I thought it could be memory but the directory size of output003/ is 21G, which similar to the other output directories at 15G, to that can’t be it, hence why I think maybe number of files is the issue.)

(EDIT2: sweeping collating)

Thanks @Benoit,

The collation failure is another issue we’ve been having. Fortunately, when the collation fails for one run, it shouldn’t have any effect on subsequent ones.

One option would be to copy the latest restart, output, work, and control directories into the scratch/public space.

If you are ok with this option, I’ve attached a pbs job script which you could use to copy them over:

#!/bin/bash
#PBS -l ncpus=1
#PBS -l mem=20GB
#PBS -q copyq
#PBS -l walltime=00:30:00
#PBS -l wd
#PBS -l storage=scratch/xv83

SHARE_DIR=/scratch/public/bp3051/

mkdir ${SHARE_DIR}
rsync -a /scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/output007 ${SHARE_DIR} 
rsync -a /scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/output008 ${SHARE_DIR} 
rsync -a /scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/restart007 ${SHARE_DIR} 
rsync -a /scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/restart008 ${SHARE_DIR} 
rsync -a /home/561/bp3051/access-esm1.5/andersonacceleration_test ${SHARE_DIR}
rsync -a /scratch/xv83/bp3051/access-esm/work/andersonacceleration_test-n10-5415f621 ${SHARE_DIR}

chmod -R +rx ${SHARE_DIR}
1 Like

OK! Thanks for helping! Done!


PS: I only rsynced

  • output007
  • restart007
  • restart008

as these are the latest ones I have. I also had to add scratch/public to the PBS storage directive.

A few more details / updates:

Maybe this is not relevant but I had recently β€œtouched” all the files in that project to keep them from the automatic scratch archiving. Could that have caused problems?

Also note that I successfully reran the 10 years a couple of times over since my last post. However, I have been seeing a lot more frequent failures of the collate jobs. In my last 10-year simulation, 4 out of the 10 collate jobs failed. These are some of the collate-job error logs, in case it is useful:

Loading access-esm1p5/2024.05.1
  Loading requirement: cice4/2024.05.21 mom5/access-esm1.5_2024.08.23
    um7/2024.07.03
payu: error: Thread 91 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least oceanbgc-3d-zoo-1monthly-mean-ym_1850_01.nc.0001 from the input fileset.  Exiting.

payu: error: Thread 105 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-v-1monthly-mean-ym_1850_01.nc.0002 from the input fileset.  Exiting.

payu: error: Thread 106 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-salt_tendency-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 129 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-ty_trans-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 137 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-age_global-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 141 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-ty_trans_rho_gm-1monthly-mean-ym_1850_01.nc.0002 from the input fileset.  Exiting.

payu: error: Thread 145 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-caco3_sediment-1monthly-mean-ym_1850_01.nc.0094 from the input fileset.  Exiting.

payu: error: Thread 146 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-temp_tendency-1monthly-mean-ym_1850_01.nc.0002 from the input fileset.  Exiting.

payu: error: Thread 147 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-salt_vdiffuse_impl-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 149 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-temp_runoff-1monthly-mean-ym_1850_01.nc.0118 from the input fileset.  Exiting.

payu: error: Thread 151 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-ubott-1monthly-mean-ym_1850_01.nc.0117 from the input fileset.  Exiting.

payu: error: Thread 152 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-sfc_hflux_coupler-1monthly-mean-ym_1850_01.nc.0097 from the input fileset.  Exiting.

payu: error: Thread 154 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-wfimelt-1monthly-mean-ym_1850_01.nc.0134 from the input fileset.  Exiting.

payu: error: Thread 155 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-rossby-1monthly-mean-ym_1850_01.nc.0125 from the input fileset.  Exiting.

payu: error: Thread 156 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-eddy_depth-1monthly-mean-ym_1850_01.nc.0172 from the input fileset.  Exiting.

payu: error: Thread 157 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-vbott-1monthly-mean-ym_1850_01.nc.0162 from the input fileset.  Exiting.

payu: error: Thread 159 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-temp-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 161 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-atm_co2-1monthly-mean-ym_1850_01.nc.0033 from the input fileset.  Exiting.

payu: error: Thread 162 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least oceanbgc-3d-phy-1monthly-mean-ym_1850_01.nc.0003 from the input fileset.  Exiting.

payu: error: Thread 163 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-surface_no3-1monthly-mean-ym_1850_01.nc.0100 from the input fileset.  Exiting.

payu: error: Thread 164 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-frazil_2d-1monthly-mean-ym_1850_01.nc.0109 from the input fileset.  Exiting.

payu: error: Thread 165 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-surface_adic-1monthly-mean-ym_1850_01.nc.0166 from the input fileset.  Exiting.

payu: error: Thread 166 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-urhod-1monthly-mean-ym_1850_01.nc.0170 from the input fileset.  Exiting.

payu: error: Thread 167 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-wnd-1monthly-mean-ym_1850_01.nc.0020 from the input fileset.  Exiting.

payu: error: Thread 168 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-patm_t-1monthly-mean-ym_1850_01.nc.0066 from the input fileset.  Exiting.

payu: error: Thread 169 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-surface_alk-1monthly-mean-ym_1850_01.nc.0048 from the input fileset.  Exiting.

payu: error: Thread 170 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least oceanbgc-3d-dic-1monthly-mean-ym_1850_01.nc.0002 from the input fileset.  Exiting.

payu: error: Thread 172 crashed with error code 9.
 Error message:
Error: cannot write variable "6"'s values!
ERROR: missing at least ocean-3d-bv_freq-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 173 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-eta_u-1monthly-mean-ym_1850_01.nc.0059 from the input fileset.  Exiting.

payu: error: Thread 174 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-vrhod-1monthly-mean-ym_1850_01.nc.0004 from the input fileset.  Exiting.

payu: error: Thread 175 crashed with error code 9.
 Error message:
ERROR: missing at least oceanbgc-2d-pprod_gross_2d-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 176 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-2d-energy_flux-1monthly-mean-ym_1850_01.nc.0006 from the input fileset.  Exiting.

payu: error: Thread 177 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-mld_sq-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 178 crashed with error code 9.
 Error message:
Error: cannot copy variable "xu_ocean"'s attributes!
ERROR: missing at least ocean-3d-lap_fric_u-1yearly-mean-ym_1850_07.nc.0006 from the input fileset.  Exiting.

payu: error: Thread 179 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-3d-cabbeling-1yearly-mean-ym_1850_07.nc.0013 from the input fileset.  Exiting.

payu: error: Thread 180 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-3d-thermobaricity-1yearly-mean-ym_1850_07.nc.0008 from the input fileset.  Exiting.

payu: error: Thread 181 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-2d-langmuirfactor-1monthly-mean-ym_1850_01.nc.0007 from the input fileset.  Exiting.

payu: error: Thread 182 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least oceanbgc-3d-det-1monthly-mean-ym_1850_01.nc.0034 from the input fileset.  Exiting.

payu: error: Thread 183 crashed with error code 9.
 Error message:
ERROR: missing at least ocean-2d-psiu-1monthly-mean-ym_1850_01.nc.0000 from the input fileset.  Exiting.

payu: error: Thread 184 crashed with error code 9.
 Error message:
Error: cannot copy variable "xu_ocean"'s attributes!
ERROR: missing at least ocean-2d-temp_xflux_ndiffuse_int_z-1monthly-mean-ym_1850_01.nc.0002 from the input fileset.  Exiting.

payu: error: Thread 185 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-2d-geolon_t.nc.0021 from the input fileset.  Exiting.

payu: error: Thread 186 crashed with error code 9.
 Error message:
Error: cannot copy variable "xt_ocean"'s attributes!
ERROR: missing at least ocean-2d-pme_river-1monthly-mean-ym_1850_01.nc.0015 from the input fileset.  Exiting.

payu: error: Thread 187 crashed with error code 9.
 Error message:
Error: cannot copy variable "xu_ocean"'s attributes!
ERROR: missing at least ocean-2d-bmf_u-1monthly-mean-ym_1850_01.nc.0026 from the input fileset.  Exiting.

payu: error: Thread 188 crashed with error code 9.
 Error message:
Error: cannot copy variable "xu_ocean"'s attributes!
ERROR: missing at least ocean-3d-salt_xflux_adv-1yearly-mean-ym_1850_07.nc.0013 from the input fileset.  Exiting.
Loading access-esm1p5/2024.05.1
  Loading requirement: cice4/2024.05.21 mom5/access-esm1.5_2024.08.23
    um7/2024.07.03
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/payu/models/fms.py", line 37, in cmdthread
    output = sp.check_output(shlex.split(cmd), cwd=cwd, stderr=sp.STDOUT)
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/subprocess.py", line 1842, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 5] Input/output error: '/scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/output001/ocean'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/g/data/vk83/apps/payu/1.1.5/bin/payu-collate", line 10, in <module>
    sys.exit(runscript())
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/payu/subcommands/collate_cmd.py", line 111, in runscript
    expt.collate()
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/payu/experiment.py", line 822, in collate
    model.collate()
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/payu/models/fms.py", line 253, in collate
    fms_collate(self)
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/site-packages/payu/models/fms.py", line 209, in fms_collate
    rc, op = result.get()
  File "/g/data/vk83/apps/payu/1.1.5/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
OSError: [Errno 5] Input/output error: '/scratch/xv83/bp3051/access-esm/archive/andersonacceleration_test-n10-5415f621/output001/ocean'
Loading access-esm1p5/2024.05.1
  Loading requirement: cice4/2024.05.21 mom5/access-esm1.5_2024.08.23
    um7/2024.07.03
=>> PBS: job killed: walltime 3603 exceeded limit 3600

Hi Benoit,

@jo-basevi and I have had a look at the shared directories from the simulation. The crash seems to be quite a strange one - the model crashed quite early during run 008, producing the MPI_abort error that you mentioned earlier:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 381 in communicator MPI_COMM_WORLD
with errorcode 0.

Usually when the model crashes during the simulation, Payu will exit and the subsequent steps (archiving model output and restarts, collation, etc.) won’t be performed. In this case however, the errorcode was 0, which Payu interpreted as a successful simulation. It attempted to proceed with the archiving, while the model hadn’t produced any restarts or output. This created a half empty restart008 directory, and then led to the FileNotFound errors that you shared.

After asking a few people it sounds like the MPI error could be quite rare. I tried setting of a new simulation from the last successful restart (restart007), and this ran without any issues, suggesting that it may have been a random error. We’ll keep an eye out to see if it occurs in other simulations.

To continue the crashed simulation, it should work to delete the incomplete restart directory restart008, and to then try rerunning the remaining years. Payu will pick up from the latest remaining restart directory restart007, and hopefully will avoid the previous error. Let us know if any issues come up when trying this.

1 Like

We’re currently investigating the collation issues as they have been impacting ESM1.5 and 1.6 simulations. We suspect there could be problems resulting from the large number of files being collated, and are currently looking into some solutions for this.

For the moment, you can manually rerun failed collations with the payu collate command. E.g to collate output directory K, run the following command from the experiment control directory:

payu collate -i K

While a longer term solution isn’t yet in place, it might also be worth increasing the collation walltime in this section of the config.yaml file to 2 hours to reduce the chance of the walltime error:

# Collation
collate:
    exe: mppnccombine.spack
    restart: true
    mem: 4GB
    walltime: 1:00:00
    mpi: false

Thanks for all the info. I guess I’m glad that this looks like a fluke? :man_shrugging: :sweat_smile:

I think I will just keep re-running manually when needed.

About increasing walltime, I don’t think that would help in my case. Almost all my collation jobs lasted 15–25 minutes (about 500 of them), which leads me to believe that the one hitting the walltime limit was also a fluke of its own, somehow stuck in a loop or other, and that it would have possibly run indefinitely. (But I may be missing something.)

Hi @Benoit, no worries, and glad that re-running is working!

Do let us know if these issues do come up again, if it’s something that ends up reappearing we might need to do some more investigation.

In the meantime are you happy for me to close this topic?

Yes, feel free to close it if you want!

1 Like