ACCESS-ESM1.5 collate error

I’m running a 3-year experiment using 1-year runs and payu run --nruns 3. I woke up to a PBS error email. One job (with the name of the experiment and “_c” had failed with

PBS: job killed: walltime 3652 exceeded limit 3600

(other experiments had failed in other spectacular ways, but I’ll leave those for another post)

Looking at the output, it looks that the model ran fine but the output0001 folder is 16GB instead of the usual 5.4GB and doesn’t have a netCDF subfolder. The ocean folder also has too many files. I guess that this is the collating script that had a problem.

Is it usual that the job would run out of time or might this be a symptom of something else that’s wrong in my config? I’m using the release pre-industrial configuration. The only thing I changed was that I added an SST nudging file. A previous run starting from a different year ran correctly.

Is it possible to re-run only that step manually to solve the issue?

Hi @eliocamp,

The collation job unfortunately fails occasionally with the error you have reported. We suspect this type of crash happens due the filesystem being overwhelmed by the number of uncollated ocean output files. When the collation job crashes, any subsequent post-processing like the UM netCDF conversion then fails to run.

To manually rerun the collation, you can use the payu command:

payu collate -d <path to output001>

see the documentation here for more detail.

This will rerun both the ocean collation and the atmosphere netCDF conversion steps. Let me know if you run into any issues.

2 Likes

Thanks, that did the trick.