ACCESS-CM2 filemove error

wghuneke · 6 August 2025 06:46

Hi,

I’m currently running a CM2-025 simulation which was working fine. I now get this filemove error:
Expected name /home/581/wgh581/cylc-run/u-cz861_piControl/work/00170101/coupled/ATM_RUNDIR/History_Data/cz861_piControla.da00170701_00

Seems like it is looking for the file from the next run, i.e. this is 00170101and it is looking for 00170701. The directory has files for 00170101 - which looks right to me. Is the error message a glitch? Did someone encounter this before? Is there a recommended way to deal with this error so I can continue the run?

Many thanks,
Wilma

atteggiani · 7 August 2025 00:46

Hi Wilma,
Would you please share the project you are using to run the suite? I would then be able to have a look at the logs and try to help.

Thank you
Davide

wghuneke · 7 August 2025 00:54

Hi Davide,

The logs are at /scratch/x77/wgh581/cylc-run/u-cz861_piControl/ .

Thanks for looking into it,
Wilma

atteggiani · 7 August 2025 01:02

Unfortunately, I’m not a member of x77.
I just requested access, but for speeding up the process, would you be able to copy the /scratch/x77/wgh581/cylc-run/u-cz861_piControl/ folder to a public folder (e.g., /scratch/public/)?
Thank you

wghuneke · 7 August 2025 01:41

I copied it to /scratch/public/wgh581/

atteggiani · 11 August 2025 07:02

Hi @wghuneke,

Sorry for the late reply but I’ve had time to look at this only now.

At the end of the u-cz861_piControl/log/job/00170101/filemove/01/job.err log (first try) there is:

=>> PBS: job killed: walltime 602 exceeded limit 600
Terminated
2025-08-06T06:07:09Z CRITICAL - failed/TERM

So it seems the filemove task was terminated only because it exceeded the walltime.

Then I think what happened is the task might have retried (maybe you manually triggered a retry or it did by itself) and the error you see in the second retry (u-cz861_piControl/log/job/00170101/filemove/02/job.err) is:

FileNotFoundError: [Errno 2] No such file or directory: ‘cz861_piControl.xhist’

because the file had already been moved in the first try.

My suggestion at this point would be:

To avoid this problem in future cycles, increase the walltime for the filemove task (the job was almost done, so I don’t think it will need much more time, maybe a 10-20% more.
In the suite.rc, under the [[filemove]] task change the execution time limit from PT10M (10 minutes) to PT12M (12 minutes):
```
572     [[filemove]]
573         inherit = POSTPROC, NCI, SHARE
574         script = """
575             filemove_access.sh
576             """
577         [[[job]]]
-578             execution time limit = PT10M
+578             execution time limit = PT12M
579         [[[directives]]]
```
Then, to update the suite run rose suite-run --reload.
Then to continue your experiment you have a few options.
The quickest solution, in my opinion, is to:
1. Manually handle the remaining processing that filemove should have carried out (since from the output of u-cz861_piControl/log/job/00170101/filemove/01/job.err and job.out it can be seen almost all the steps were carried out successfully).
  You can achieve this by running:
```
tar -cvf /scratch/x77/wgh581/archive/cz861_piControl/restart/ocn/restart-00170630.tar /scratch/x77/wgh581/cz861_piControl/work/00170101/coupled/OCN_RUNDIR/RESTART/*.res*`
```
  The line above was computed by resolving lines 82 and 83 of the /scratch/x77/wgh581/u-cz861_piControl/share/access-cm2-drivers/src/filemove_access.sh file.
2. Start the run (on hold) at the 00170101 cycle (where it failed) by running:
```
rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101 --hold
```
  then in the Cylc GUI manually mark the coupled, filemove and ocean_ke_check tasks as succeeded (right click on the task > reset state > succeeded).
  Then, release the run (“play” button in the Cylc GUI top-left corner). The suite should continue running without problems.
  In case you still get errors within the cycle, you might have to re-run the entire 00170101 cycle. In that case, simply run:
```
rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101
```

Hopefully this solves your issue.

About the impossibility to “retry” the filemove task because it would fail if any file has already been moved, I think this is a bad implementation and I opened an issue in access-cm2-drivers repo about it.
Thank you for reporting this!

Davide

wghuneke · 12 August 2025 04:01

Hi, thanks for getting back! I had restarted the suite to rerun the year in question and the model ran fine matching what you suggested!

Topic		Replies	Views
ACCESS-CM2 bad or missing value Coupled Model	2	220	11 May 2023
Unable to access ACCESS-OM2-01 future perturbation outputs Technical data , cosima-cookbook , bug	10	277	12 October 2023
ACCESS-CM2 setup and tutorial issues/problems Coupled Model access-cm2 , rose , suite , tutorial	6	472	14 December 2022
Job fails trying to access /scratch while the database is on /g/data COSIMA	18	534	16 January 2023
ACCESS-CM2 walltime runout Coupled Model	8	304	11 May 2023

ACCESS-CM2 filemove error

Related topics