I’ve got two problems related to running MOM6 on Setonix:
After running up to 31st December of year one (1991), it won’t start a new run crossing over to year two, outputting the error:
FATAL from PE 2091: diag_manager_mod::register_diag_field: file=ocean_month_z: Invalid date. Date=1992-02-31 00:00:00
This happens even when trying to run a single day. I have suspicions that it’s potentially due to how I’ve run the first year - rather than single monthly run-increments, it’s been messy and staggered (e.g., run 5-days here, run 3-months there). This was due to troubleshooting and also the following payu error…
Payu doesn’t listen to arguments given after payu run (e.g., payu run -n 3). This is working fine for @ChrisC28 on the same machine so I might’ve done something along the way to disable this accidently.
Has anyone come across either of these errors in the past?
We are using the “NO LEAP” calendar, with JRA55-do RYF forcing.
(base) jreilly@setonix-01:/software/projects/pawsey0410/jreilly/eac_003_v5_copy1> payu run -n 2
payu: warning: Job request includes 118 unused CPUs.
payu: warning: CPU request increased from 6154 to 6272
sbatch -A pawsey0410 --time=06:30:00 --ntasks=6272 --exclusive --ntasks-per-node=128 --wrap="/software/setonix/2022.11/software/cray-sles15-zen3/gcc-12.1.0/python-3.9.15-xjiu6sfxngs3gl5nmq6sqxlicihla66p/bin/python3.9 /software/projects/pawsey0410/jreilly/setonix/python/bin/payu-run"
Submitted batch job 1973498
The first two lines are from some debugging Chris and I were doing. There were also some other issues that Chris had fixed previously, related to strict specification of number of tasks/node and number of nodes that you might see different to the gadi version.
Separate installs, but as far as I can tell they are the same. I just wasn’t entirely sure how to point to Chris’ install in my conda environment. I tried conda-develop /path/to/chris/payu/library/ which didn’t seem to work.
Myself and Chris have been weighing up whether it is worth the effort persisting with payu here as it has been a bit of a headache for the past few months. And especially when we start to look at setting up IAF configurations, I anticipate a lot more challenges. Any advice on alternative approaches? Or best ways forward?
As far as "not listening to command line arguments like -n 2, payu passes information like number of runs remaining to subsequent submissions via environment variables
If there was anything interfering with environment variables in your job submission that might be culprit.
It’d be a damn shame to have come this far and drop it now.
I don’t know what the plans are for ACCESS-NRI supporting users at Pawsey, but it is not an impossibility, and we’re about to get an influx of new staff in the next few months so resourcing some assistance is on the cards if we were to go down that route. I don’t want to give you false hope, but there is a glimmer …
As far as alternative approaches, I don’t really have a good suggestion. It needs someone to log in and take a gander, and I’m afraid I just don’t have the time to do that these days. Maybe if you asked @angus-g really really nicely he’d take a look.
Thanks for that direction Aidan. I’m going to spend some time understanding the workflow of payu a bit more, then it might be good to catch up with @angus-g some time soon to find what’s tripping things up. Angus, if it’s alright with you, I’ll message you directly when I know enough to at least follow along with what you might think the problem is. Maybe we could catch up over zoom with @ChrisC28 for a quick discussion on best way forward on Setonix?
For what it’s worth, I don’t think this has anything to do with payu. It’s probably hitting some edge case in FMS setting up a monthly diagnostic file, due to the sporadic run segments. It probably makes sense to disable monthly diagnostics for such short run segments anyway, since they won’t contain anything.
Looks like your right (once again) with that suggestion of the FMS problem @angus-g . It’s able to run now that I’ve removed the monthly diagnostics. It’s not that important now, but any suggestion on how to start saving monthly output without starting from day 0 again?
There’s still the issue of payu not responding to command line arguments but I should be able to find the problem for this one.
In regards to setting up Interannually Forced runs, are there any quick pointers? Otherwise I’ll have more of a read and then make a new post if/when I get stuck.
I guess you’d probably want to run a segment to line back up with the start of a month, and then resume running segments of at least one month in length from there on.
I think payu prints its submission command when you run it, I think that would be a useful diagnostic to see if it’s setting the correct environment variable, but it’s not making it through. Or perhaps look at the slurm output for a run you expect to continue, and see if it fails to execute the resubmission command.
I don’t think there’s anything automated for this, unfortunately. @ashjbarnes might have some experience with this by now? You can fiddle around with the data_table after a year of running, although I expect FMS might complain about trying to interpolate beyond the end of the year unless everything is set up perfectly on the time axes.
What I’ve done in the automated pipeline is to reset the calendar of all my forcing files to be DAYS SINCE your experiment start date. I don’t yet know whether this will mess other things up, but at least it handles issues related to calendar type (unless, god forbid, you trip to run your model on an actual leap year)
I haven’t automated data_table modification for IAF runs though and probably won’t. It would be awesome if someone put this functionality into the pipeline though!
If we use a sufficiently modern version of FMS, with -Duse_yaml specified when building, it actually supports a data_table.yaml. That would be pretty easy to process with Python-based tools if necessary. I’m not sure about how to automatically grab the date from the current/latest run though. Otherwise, just replacing a sentinel value like YEAR from a template file is easy enough.