I’m currently running a CM2-025 simulation which was working fine. I now get this filemove error: Expected name /home/581/wgh581/cylc-run/u-cz861_piControl/work/00170101/coupled/ATM_RUNDIR/History_Data/cz861_piControla.da00170701_00
Seems like it is looking for the file from the next run, i.e. this is 00170101and it is looking for 00170701. The directory has files for 00170101 - which looks right to me. Is the error message a glitch? Did someone encounter this before? Is there a recommended way to deal with this error so I can continue the run?
Unfortunately, I’m not a member of x77.
I just requested access, but for speeding up the process, would you be able to copy the /scratch/x77/wgh581/cylc-run/u-cz861_piControl/ folder to a public folder (e.g., /scratch/public/)?
Thank you
So it seems the filemove task was terminated only because it exceeded the walltime.
Then I think what happened is the task might have retried (maybe you manually triggered a retry or it did by itself) and the error you see in the second retry (u-cz861_piControl/log/job/00170101/filemove/02/job.err) is:
FileNotFoundError: [Errno 2] No such file or directory: ‘cz861_piControl.xhist’
because the file had already been moved in the first try.
My suggestion at this point would be:
To avoid this problem in future cycles, increase the walltime for the filemove task (the job was almost done, so I don’t think it will need much more time, maybe a 10-20% more.
In the suite.rc, under the [[filemove]] task change the execution time limit from PT10M (10 minutes) to PT12M (12 minutes):
Then, to update the suite run rose suite-run --reload.
Then to continue your experiment you have a few options.
The quickest solution, in my opinion, is to:
Manually handle the remaining processing that filemove should have carried out (since from the output of u-cz861_piControl/log/job/00170101/filemove/01/job.err and job.out it can be seen almost all the steps were carried out successfully).
You can achieve this by running:
tar -cvf /scratch/x77/wgh581/archive/cz861_piControl/restart/ocn/restart-00170630.tar /scratch/x77/wgh581/cz861_piControl/work/00170101/coupled/OCN_RUNDIR/RESTART/*.res*`
The line above was computed by resolving lines 82 and 83 of the /scratch/x77/wgh581/u-cz861_piControl/share/access-cm2-drivers/src/filemove_access.sh file.
Start the run (on hold) at the 00170101 cycle (where it failed) by running:
rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101 --hold
then in the Cylc GUI manually mark the coupled, filemove and ocean_ke_check tasks as succeeded (right click on the task > reset state > succeeded).
Then, release the run (“play” button in the Cylc GUI top-left corner). The suite should continue running without problems.
In case you still get errors within the cycle, you might have to re-run the entire 00170101 cycle. In that case, simply run:
rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101
Hopefully this solves your issue.
About the impossibility to “retry” the filemove task because it would fail if any file has already been moved, I think this is a bad implementation and I opened an issue in access-cm2-drivers repo about it.
Thank you for reporting this!