I have been running mom6-panan-005 fine for 3 months now without a problem submitting each months individually. Yesterday, I submitted another 3 months with payu run -n 3. The first month (i.e April 1991) ran fine, but then MOM6 crashed in the second month right after initialization:
That’s a new one to me! Just to get the dumb questions out of the way, was your job terminated by PBS for running out of walltime/memory? The PBS outputs mom6_panan-005.{e,o}* are only readable by you by default.
Yes, it has indeed exceeded it’s walltime which I set to 3:30. Normally it takes about 3 hours for 1 months. It shouldn’t take 3.5 hours to initialize though or whatever it’s trying to do
I seem to recall payu taking quite a while to generate the manifests for the restart files, which you wouldn’t have seen on the first segment. Not sure what right advice is here, maybe @aekiss would have a suggestion from experience with ACCESS-OM2-01?
I guess I was suggesting that it’s only a weird error because the model was killed due to running out of walltime (which was indeed the case). I suspect the root of that is because running a restart is taking longer than the original cold-start segment, which could be a few things:
slow restart input/initialisation?
payu calculating a restart manifest, which takes a while?
aidanheerdegen
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
9
Maybe. 135G of restarts is significant, also there are only 24 of them, so you’re not getting the full benefit of parallelising across the 48 cores.
To test how long it is taking you can just try
time payu setup
before your next run. This will compute the hashes for the restart manifest and give you an idea if that is a significant overhead.
I did have a notion to compute those as a post-processing job, or have a mode where this calculation was done outside the main PBS job, like a pre-processing step, but it wasn’t straightforward and didn’t seem necessary based on the use cases I’d tested on up till then which weren’t as taxing as this example.
This is the issue @aekiss was referring to for reference
1 Like
aidanheerdegen
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
10
Interesting, timing:
$ time parallel md5sum {} ::: $(find archive/restart002/ -iname "*.nc")
real 2m5.028s
user 6m40.891s
sys 4m7.886s
but for a single 7G file:
$ time md5sum archive/restart002/MOM.res_2.nc
9ca2d6455361062694d35609a05eec0d archive/restart002/MOM.res_2.nc
real 0m16.347s
user 0m10.153s
sys 0m6.012s
I guess it is saturating file IO, limiting the speed with which they can be read in. So parallelisation is of some value, but not a universal panacea.
It is faster than serial, though not by as much as I would have thought:
$ time md5sum $(find archive/restart002/ -iname "*.nc")
real 5m15.216s
user 3m19.317s
sys 1m54.614s
and using 4 processes is actually faster than 24:
$ time parallel -j 4 md5sum {} ::: $(find archive/restart002/ -iname "*.nc")
real 1m38.735s
user 3m27.332s
sys 2m26.171s
I am through the first month now. The output files 19910401.*.nc are however empty (time = UNLIMITED ; // (0 currently)
And it continues to run (atm at 1991/05/06) in the same pbs job 71980789.gadi-pbs. (I changed the walltime again to 5 hours for now, just in case). So I guess, it will run another ~1.5 hours and then run out of walltime some time in the second month.
I used this line payu sweep && payu run -n 3 and assumed that three different pbs jobs where submitted one after another. So I should be seeing a different pbs job number?
aidanheerdegen
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
12
These are the diagnostic files. It isn’t always useful to use these to diagnose the status of a run as the frequency of output to these files is determined by the configuration in diag_table.
I don’t quite understand this.
In any case, I can see your run finished, does that mean your problem is resolved?