ACCESS-CM2 filemove error

Hi,

I’m currently running a CM2-025 simulation which was working fine. I now get this filemove error:
Expected name /home/581/wgh581/cylc-run/u-cz861_piControl/work/00170101/coupled/ATM_RUNDIR/History_Data/cz861_piControla.da00170701_00

Seems like it is looking for the file from the next run, i.e. this is 00170101and it is looking for 00170701. The directory has files for 00170101 - which looks right to me. Is the error message a glitch? Did someone encounter this before? Is there a recommended way to deal with this error so I can continue the run?

Many thanks,
Wilma

1 Like

Hi Wilma,
Would you please share the project you are using to run the suite? I would then be able to have a look at the logs and try to help.

Thank you
Davide

Hi Davide,

The logs are at /scratch/x77/wgh581/cylc-run/u-cz861_piControl/ .

Thanks for looking into it,
Wilma

1 Like

Unfortunately, I’m not a member of x77.
I just requested access, but for speeding up the process, would you be able to copy the /scratch/x77/wgh581/cylc-run/u-cz861_piControl/ folder to a public folder (e.g., /scratch/public/)?
Thank you

I copied it to /scratch/public/wgh581/

1 Like

Hi @wghuneke,

Sorry for the late reply but I’ve had time to look at this only now.

At the end of the u-cz861_piControl/log/job/00170101/filemove/01/job.err log (first try) there is:

=>> PBS: job killed: walltime 602 exceeded limit 600
Terminated
2025-08-06T06:07:09Z CRITICAL - failed/TERM

So it seems the filemove task was terminated only because it exceeded the walltime.

Then I think what happened is the task might have retried (maybe you manually triggered a retry or it did by itself) and the error you see in the second retry (u-cz861_piControl/log/job/00170101/filemove/02/job.err) is:

FileNotFoundError: [Errno 2] No such file or directory: ‘cz861_piControl.xhist’

because the file had already been moved in the first try.

My suggestion at this point would be:

  1. To avoid this problem in future cycles, increase the walltime for the filemove task (the job was almost done, so I don’t think it will need much more time, maybe a 10-20% more.
    In the suite.rc, under the [[filemove]] task change the execution time limit from PT10M (10 minutes) to PT12M (12 minutes):

    572     [[filemove]]
    573         inherit = POSTPROC, NCI, SHARE
    574         script = """
    575             filemove_access.sh
    576             """
    577         [[[job]]]
    -578             execution time limit = PT10M
    +578             execution time limit = PT12M
    579         [[[directives]]]
    

    Then, to update the suite run rose suite-run --reload.

  2. Then to continue your experiment you have a few options.
    The quickest solution, in my opinion, is to:

    1. Manually handle the remaining processing that filemove should have carried out (since from the output of u-cz861_piControl/log/job/00170101/filemove/01/job.err and job.out it can be seen almost all the steps were carried out successfully).
      You can achieve this by running:
      tar -cvf /scratch/x77/wgh581/archive/cz861_piControl/restart/ocn/restart-00170630.tar /scratch/x77/wgh581/cz861_piControl/work/00170101/coupled/OCN_RUNDIR/RESTART/*.res*`
      
      The line above was computed by resolving lines 82 and 83 of the /scratch/x77/wgh581/u-cz861_piControl/share/access-cm2-drivers/src/filemove_access.sh file.
    2. Start the run (on hold) at the 00170101 cycle (where it failed) by running:
      rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101 --hold
      
      then in the Cylc GUI manually mark the coupled, filemove and ocean_ke_check tasks as succeeded (right click on the task > reset state > succeeded).
      Then, release the run (“play” button in the Cylc GUI top-left corner). The suite should continue running without problems.
      In case you still get errors within the cycle, you might have to re-run the entire 00170101 cycle. In that case, simply run:
      rose suite-run -C ~/roses/u-cz861_piControl/ -- --start-cycle=00170101
      

Hopefully this solves your issue.

About the impossibility to “retry” the filemove task because it would fail if any file has already been moved, I think this is a bad implementation and I opened an issue in access-cm2-drivers repo about it.
Thank you for reporting this!

Davide

1 Like

Hi, thanks for getting back! I had restarted the suite to rerun the year in question and the model ran fine matching what you suggested!

1 Like