Hi,
I have a concern about the collation of restart files using the 1deg_jra55_ryf release supported by ACCESS-NRI, and am seeking help to resolve it.
I have noticed that when you run:
payu run -i K
Upon run completion, payu collate will act upon the output folder number K, but will act upon the restart folder K-1. This means that the model will run with the uncollated restart inputs on each resubmission, and then retrospectively collate the restart when it has completed the subsequent run.
The problem with doing that is when the new work directory is set up, it will link both uncollated restart files such as ocean_temp_salt.res.nc.0000 and 0001 etc from the latest restart folder, and it will also link the ocean_temp_salt.res.nc from the initial conditions that are included in the config.yaml file.
If I look at my work directory during a run, this link for ocean_temp_salt.res.nc points to /g/data/vk83/experiments/inputs/access-om2/ocean/initial_conditions/global.1deg/2020.10.22/ocean_temp_salt.res.nc
On every job submission! The tiled (uncollated) restart files are there too from the last restart folder, but it seems problematic that the working directory has an TS restart file that is being repeatedly copied from the starting ICs.
(P.S. Here I donāt know whether the model prioritises the ācollatedā or āuncollatedā TS restart file if it happens to see bothā¦ but at the least this seems like a problematic blending of old and new restart files.)
Under the āoldā config.yaml structure this would not have occurred, because it only looked for a restart folder, and if the new restart folder existed, the original restart folder is ignored. But under the new config.yaml structure, there is a problem that TS restart files (and potentially others) are being unwittingly carried into all the subsequent work directories.
The first submission works as intendedā¦ it initialises the run from the ocean_temp_salt_res.nc specified in config.yaml. But when the next run is submitted, restart000 does not contain a file called ocean_temp_salt_res.nc, it only contains the uncollated tiled version (with extensions 0000ā¦0011).
So when the next jobs are submitted, the run directory contains both the most recent (uncollated) TS restart files, and the original TS restart, which gets repeatedly linked into every job submission for K=1,2,3, and onwards (in collated form).
The question then is, if the run sees both a collated restart file and uncollated restart file for ocean_temp_salt.res.nc, which one would it use as the TS restart file?
I am moderately confident the restart reproducibility tests run before release would have failed if it was picking up the wrong restart.
I did three 1 month runs and itās not clear from the logs, the initial start log says this:
Initializing tracer number 1
at time level tau. This tracer is called temp
Reading restart for prog tracer temp from file ocean_temp_salt.res.nc
After reading ic, linearly interpolate temp to partial cell bottom.
Completed initialization of tracer temp at time level tau
Initializing tracer number 2
at time level tau. This tracer is called salt
Reading restart for prog tracer salt from file ocean_temp_salt.res.nc
After reading ic, linearly interpolate salt to partial cell bottom.
Completed initialization of tracer salt at time level tau
and the version from a restart says:
Initializing tracer number 1
at time level tau. This tracer is called temp
Expecting only one time record from the tracer restart.
Reading restart for prog tracer temp from file ocean_temp_salt.res.nc
Completed initialization of tracer temp at time level tau
Initializing tracer number 2
at time level tau. This tracer is called salt
Expecting only one time record from the tracer restart.
Reading restart for prog tracer salt from file ocean_temp_salt.res.nc
Completed initialization of tracer salt at time level tau
There are minor differences between the two, implying maybe they are reading different files.
Hi @anton,
I did a check of this by first saving a backup of an uncollated restart folder, then
running the model āas isā, i.e. where the ocean_temp_salt.res.nc gets linked from original ICs
running the model again from the same starting point but commenting out ocean_temp_salt.res.nc from config.yaml, i.e. forcing the model to use the uncollated restart.
The good news is: I get identical checksums on the restarts in both cases. So yes, I think it ignores the original ICs, and just uses the uncollated (most recent) restarts.
It seems like strange behaviour to link a file in the work directory that really shouldnāt be used. But Iām glad this is indeed not causing a problem!
I agree this is confusing, and it may just be good luck that the correct restart file is getting loaded. Thereās no obvious fix to me, but iāll give it some thought!
But under the new config.yaml structure, there is a problem that TS restart files (and potentially others) are being unwittingly carried into all the subsequent work directories.
Hi @anton,
Now that I consider it, Iām not that familiar with the older versions of ACCESS-OM2-1deg. I was probably thinking of the ACCESS-ESM1.5 configs from before and after the recent releasesā¦ so my comment about the āoldā and ānewā versions here may not be relevant.
Iām probably thinking more of how in ACCESS-ESM1.5 configs, the ocean_temp_salt.res.nc is specified under the restart part of the config file, which I think is a better place for it than specifying it under the input sub-heading. Since you do in fact want this file to be ignored after the first submission.
Anyhow, thanks for looking into this, and glad to know itās not causing problems. Like you said, the reproducibility tests would have picked up something like this is if it was affecting the simulations.
Sorry, Iām not familiar with how MOM5 restarts are specified or picked up. MOM6 uses a completely different method, as far as Iām aware. I suppose it wouldāve been pretty clear if it was getting the wrong file!
Sorry, I also donāt know how the model picks up restarts.
In terms of payu, as David has already said, payu will only automatically collate the previous restart, not the latest one. I think this is so payu can start a new model run without waiting for the separate collation job to complete. Currently, the only way to collate the latest restart is to run the collate command with the directory manually, e.g. payu collate -d archive/restart000.
If some input files should only be used for the first submission, then I think Davidās suggestion of having them under the directory set in restart: in config.yaml would work.
Thanks @jo-basevi - ill make an issue. The docs say the restart: line only support a directory ā¦ we might want them to support individual files like the input: sections do.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
12
At the risk of pointing out what has become obvious, this is fine. If you check the code the uncollected restarts are preferred over initial conditions file if present.
I can see that it might be a little confusing, but this is the way it has always worked.
The most recent restarts are not collated on purpose for a couple of reasons:
It is difficult to guarantee the restarts will be collated before the next run begins (as @jo-basevi already pointed out)
High res models benefit (slightly) from parallelism when restarting the model from uncollated restarts
A definite downside is that manifests contain checksums for restart files that are subsequently collated, so it makes it more difficult to find restarts associated with a specific run.