Restarting payu mid-experiment under a different project

Hi,

Hope you had a good Christmas.

I needed to change projects during an experiment (2 actually).

I edited config.yaml in my payu control directory:

project: rp23

Force payu to always find, and save, files in this scratch project directory

shortpath: /scratch/p66

and then I did:

payu setup

assuming that it would need to do this to pick up the new project? – perhaps this was a mistake?.

payu run -f -n 1200

It seemed that it did pick everything up and restarted. At the end of the first simulation(year) I checked more thoroughly and it was archiv/e or work/ was writing to rp23 not p66. Anyway I could manually move this if need be. Unfortunately (as always happens) rp23 space got filled up and jobs stalled in the queue. I decided to just start from scratch running under rp23 but writing to p66. So I used

payu setup
payu sweep
payu run -f -n 1200

links to p66 were set up correctly and it all seemed to be running again. Anyway, post Jan 1 when quotas are renewed and I go back to running under p66 what changes do I need to make to continue smoothly and avoid this?

Thanks,
Jhan

Hi, I hope you had a good christmas too!

I would’ve expected having the config.yaml settings project: rp23 would use rp23 for the project in the PBS submission job and then shortpath: /scratch/p66 would be used as the top-level laboratory directory (which contains the archive/ and work/ sub-directories). If shortpath is not defined in config.yaml, the laboratory directory defaults to /scratch/$PROJECT, so it would be /scratch/rp23.

So I’m able to try replicate the issue, do you know what version/module of payu you were using? When you started from scratch running under project rp23 but writing to p66, what config.yaml changes did you add? Thanks!

gadi:PI-case2a-expt-11659a9b> payu --version
payu 1.1.5

Thanks Jo

Thanks Jhan, so there is an issue in payu/1.1.5 when only project is set in config.yaml and shortpath or laboratory is not defined (laboratory is rarely used payu config option to set an absolute path for model laboratory). The payu laboratory path would then default to /scratch/$PROJECT where $PROJECT was the default project on the login node, but in the PBS payu run job, it defaulted to the value of the project in config.yaml. This bug is fixed in the next version of payu which is currently in prerelease that we are working on releasing soon. The workaround in the meantime was to define both project and a shortpath in the config.yaml.

So I wonder if that is what is causing the issue, but initially you said you had shortpath defined in the first instance of the config.yaml? If that is the case, would it be possible for me to have an example of the config.yaml or configuration you were using?

/scratch/public/jxs599/jo/config.yaml

I dont know. that this will be helpful though. As you can see I changed both project and short. After I made the decision to sacrifice the XXX years it had done and retarted cleanly (with payu sweep), it did what I expected. The problem is that I dont want to sweep and lose 73 years. Alternatively I could point to restarts from year 73. It is not even clear that this is the final model configuration we will want. This is just a spinup run, it’ll conceivably take 1000 years for the model (especially the C-N-P pools) to equilibrate. However, given that the big ticket items are “fixed”, relatively small changes on the land-side would at least (almost) guarantee this as a better starting point for any further spinup. That being said, the potential changes on the land-side would possibly affect the C-N-P cycle more so than the climate. The spinup issue again. Although a decent spinup of PI-case2a is still potentially useful, even if just for a decently long simulation analysis. So I guess what I need to be able to do is knock “project” back to p66 (or simply re-#). It sounds like this way has a better chance of success as my default $PROJECT is p66 anyway?

Hi Jhan, thanks for the configuration. Sorry I’ve tried reproducing the error you had but I am having no luck so far.

So I guess what I need to be able to do is knock “project” back to p66 (or simply re-#). It sounds like this way has a better chance of success as my default $PROJECT is p66 anyway?

Yes I think changing back the project to p66 should work. I’ve run a small test experiment with a configuration with the following (my default project is tm70):

project: tm70
shortpath: /scratch/tm70

I changed the project to lg87, and tested that it used the same archive in /scratch/tm70/. After a couple runs using the different project, I switched the project back to my default project, and tested that it again used the same archive in subsequent runs.

If you switch the project back and it does use a different archive, the files should still be somewhere on the filesystem. So please let me know if you do run into any errors!

OK thanks Jo. I’ll change it back to p66.

Cheers,

Jhan

FYI we’ve used this approach for payu for a number of years and it has worked fine. As long as you set shortpath you should be able to change the project and still pick up previous restarts from the same location, and write outputs there.

One little trick to consider is to use setgid to ensure all files are written with the same project code as the shortpath project location

Note that this will mean setting chmod +s for the work and archive subdirectories in the shortpath/$USER/$MODEL directory.

On Friday afternoon I re-commented the lines which I previously un-commented and changed. project=rp23. shortpath=p66. So it should all go back to (default to )normal right, p66. I checked it on the Saturday and the next CRUN submitted to the queue died (before even strating). I did

payu run -f -n 1126

Today it has started again from 000, is upto 010.

/scratch/p66/jxs599/access-esm/archive/PI-case2a-expt-c108b00c/output010/

The data from the previous 77 year run is still there AT:

/scratch/p66/jxs599/access-esm/archive/PI-case2a-expt-11659a9b

but it recreated the link and started again. I cant see how group iid permissions would affect this? @Aidan

Great that the data is still there, but I wanted it continue from year 77 not start again. It would still be beneficial if we could force this in a new run? Only because it is several days of compute time.

You haven’t said what the error was, so it’s hard to know what might have happened.

It won’t. Sorry, my suggestion wasn’t to solve your error, simply something else worth doing when running under multiple compute projects to ensure all your files are owned by the correct group.

This should be quite straightforward, but we don’t have enough information to diagnose why it isn’t working for you.

Where is your control directory? The directory where you run payu for this experiment?

TBH I was just looking at the table wall times used from the .o files to conclude that it didnt crash in the model.

The payu control directory is here:
/g/data/p66/jxs599/ESM16/PAYU/PI-case2a/PI-case2a

It looks like this was the failed run on the Friday:
PI-case2a.e131819553

It seems reminiscent of that same git message we encountered earlier? This would gel, I was using the git diff_wraper which was problematic earlier.

It’s looking like we are going to start a new run with some different root fractions that Rachel wants to look at. We’ll probably start from the same point for comparrisons sake but
it would perhaps be beneficial to start this run from the 77 year restarts from PI-case2a. In the payu config are they all specified in the same place?

Thanks for the control directory! I’ve had a quick look at the git history and the error logs, but I’m not finding a reason yet as to why payu is breaking when the project is changed so I will keep investigating.

In the meantime, to have a new experiment start from a restart, you can create and checkout a new branch in the control directory and start from the last restart in PI-case2a-expt-11659a9b by running

$ payu checkout -b restart-expt -r /scratch/p66/jxs599/access-esm/archive/PI-case2a-expt-11659a9b/restart077

Note -b creates and checks out a new git branch named restart-expt, and -r/--restart sets restart in config.yaml. The above would be starting a new experiment however so will have a new experiment UUID and archive, but because there’s no existing restarts in archive, it will use the restart config in config.yaml.

Hi Jhan, I’ve tried investigating this further, but I haven’t been able to replicate the issue even with using the extra information from the control directory.

Going back through the git history, I saw that payu created a new archives when project was defined, but not a shortpath so I think that was what caused the issues initially.

But what I am still confused about is when after several runs with both project and shortpath defined in config.yaml, e.g.

project: rp23
shortpath: /scratch/p66

Then after commenting both out, payu should’ve still been able to find the pre-existing archive in /scratch/p66. Instead, payu logged out it couldn’t find a pre-existing archive and then created a new experiment UUID and archive. I’ve tried to replicate this issue using the same file permissions, commands, and config.yaml, and checked with others to see if they could reproduce the issue when running the ESM1.6 configuration. But after all the tests, payu keeps finding the pre-existing archive. The only reason I can think of is if metadata.yaml file in the control directory was manually modified before the first payu run after removing both project and shortpath, and this resulted in a different experiment_uuid.

I’m sorry that I haven’t figured out why yet so I’ll be very interested to know if you or others run into this issue again. I’m also happy to meet online or in person if you have any questions about running payu.

Thanks for the thorough investigation Jo. We hopefully won’t have to deal with this changing projects issue again. The 40, 77 year runs should be enough for us to use to make a decision on this aspect of the land base configuration. We’ll have to pull in all of the ocean side changes after that anyway, so start from zero again. Given that you cant reproduce the problem, it is likely my environment. A prime candidate that ties into payu is my .gitconfig. There was an earlier discussion related to that with Aidan that is on here somewhere although I cant find it right now