Has anyone found mdss tape storage to be slow recently? A cylc task has been failing for ~24 hours that uses mdss get in a suite using copyq. When I’ve used this in the past it would be < 1 hr. I’m trying to solve if it is just a “normal” delay to access the tape storage, or another problem in this task, which uses mdss get programatically.
Output from qstat -q:
copyq -- -- -- -- 0 44 -- E R
copyq-exec -- -- -- -- 56 20 -- E R
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
2
Probably something best directed to help@nci.org.au as they are able to investigate issues with your jobs and the mdss.
Having said which, can you run the same command directly from the command line? If so, can you run an interactive copyq job and run the same command interactively?
Also check that the project running the job hasn’t exhausted compute allocation, or over storage allocation for the project being used.
I am curious that you have an mdss get step as part of a suite. I’m not sure I’ve ever seen that.
@Aidan, this is part of a BoM developed suite that uses BARRA-R2 lateral boundaries (on tape) rather than ERA5 (on disk, but requires conversion). However there have been two technical issues that had to be overcome to regain previous functionality.
The response to a query on whether a file has been stage has changed. mdsss -P <project> dmls -l <file> used to return (DUL) when staged, but now returns either (DUL) or (REG), depending on how old the file is (older files will return REG). This changed behaviour requires updating of programmatic mdss scripts within suites.
If one of the four 1PB volume that is used to stage data from tape runs out of space (as one did over this weekend), retrieval can hang. This showed itself through the dmls query changing from migrating (MIG) to offline (OFL) and staying there, until intervention from NCI.
After addressing these issues my suite is now running.