Mdss tape delay of > 24 hrs

Has anyone found mdss tape storage to be slow recently? A cylc task has been failing for ~24 hours that uses mdss get in a suite using copyq. When I’ve used this in the past it would be < 1 hr. I’m trying to solve if it is just a “normal” delay to access the tape storage, or another problem in this task, which uses mdss get programatically.

Output from qstat -q:

copyq              --      --       --     --      0    44   --   E R
copyq-exec         --      --       --     --     56    20   --   E R

Probably something best directed to help@nci.org.au as they are able to investigate issues with your jobs and the mdss.

Having said which, can you run the same command directly from the command line? If so, can you run an interactive copyq job and run the same command interactively?

Also check that the project running the job hasn’t exhausted compute allocation, or over storage allocation for the project being used.

Thanks Aiden,

For your questions:

  • yes, I have now emailed help@nci.org.au, and will follow up here if I get a solution.

  • yes I can run the same command directly from the command line from a login node, and from an interactive copyq, but staging remains incomplete.

  • yes I’ve checked the project has compute and storage remaining.

That does sound like an technical issue with mdss.

I am curious that you have an mdss get step as part of a suite. I’m not sure I’ve ever seen that.

We’ve had issues for a while now with MDSS, one of the issues was the system not properly recognising when data had been staged off the tapes.

Hi Scott, exactly the issue, and I have a solution now, I’m just waiting for NCI to acknowledge and I’ll post here.

I am curious that you have an mdss get step as part of a suite. I’m not sure I’ve ever seen that.

@Aidan, this is part of a BoM developed suite that uses BARRA-R2 lateral boundaries (on tape) rather than ERA5 (on disk, but requires conversion). However there have been two technical issues that had to be overcome to regain previous functionality.

  1. The response to a query on whether a file has been stage has changed. mdsss -P <project> dmls -l <file> used to return (DUL) when staged, but now returns either (DUL) or (REG), depending on how old the file is (older files will return REG). This changed behaviour requires updating of programmatic mdss scripts within suites.
  1. If one of the four 1PB volume that is used to stage data from tape runs out of space (as one did over this weekend), retrieval can hang. This showed itself through the dmls query changing from migrating (MIG) to offline (OFL) and staying there, until intervention from NCI.

After addressing these issues my suite is now running. :tada: