Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
1
How do I start a new perturbation experiment?
User story
As a new user I want to be able to start a new perturbation experiment from an existing ACCESS-OM2 control run.
How do I identify which control run to choose?
Which experiment do I clone?
How do I know where to branch my experiment?
How do I determine which restart files to use for my chosen branch point?
Where can I find those restart files on disk?
How do I configure payu to use the correct restart files?
How do I know I’ve done things correctly?
Background
At the COSIMA meeting discussing the scope of an ACCESS-NRI release of ACCESS-OM2 a use-case that would be useful for ACCESS-NRI to assist with was getting new users up and running new experiments from existing control runs.
Current workflow:
Talk to supervisor
Ask data owner
Ask data owner
Ask data owner: they will tell user to consult the git log of the experiment repository which has the run number, and check in restart manifest for restart file directory
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
2
I have made this a wiki, with the intention that it should be edited to better reflect the experience of a new user (which I am not). So feel free to dive in and change as required.
I encourage anyone to create more user stories, and not just related to COSIMA, but just put them in the correct category. User stories are a great way to capture workflows, how they might be blocked or inefficient and we can improve them.
ACCESS-NRI would like to use user-stories as qualitative measures of impact and improvement for the community. In many cases what we might do can’t be well measured in metrics, but the community will “just know” it is a lot easier than it used to be. User stories are a way to capture this improvement, by documenting the improvement in the workflow for a particular user story.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
3
@adele157 and @rmholmes Does this user story accurately reflect the “use case” discussed in that COSIMA meeting?
Looks good to me Aidan. One additional thing that could be added to the first list I’ve now added to the first list:
How do I know I’ve done things correctly (i.e. the only difference between my new simulation and the previous control simulation is what I want)? [side note: Are our simulations bitwise reproducible? I can’t remember.].
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
6
Yes they should be bitwise reproducible in a deterministic sense: the same model configuration with the same inputs will produce the same outputs. Some of the models are known to be not bitwise reproducible if processor layout is changed for example, but that is a more stringent reproducibility criteria.
The issue @adele157 linked to is weird, and I would say anomalous, but we just don’t know why.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
7
I think we should be adding an initial step for every forked experiment where the experiment is forked and run without changes and confirm the outputs are unchanged. That confirms that when changes are made that the control run is a valid comparison.
Yes I agree. It’s pretty common to run the control simulation forward anyway as often the diagnostics that we want aren’t available.
1 Like
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
9
That’s an important piece of information too. Check the diagnostics required are present in the control. If they’re not, and you have to re-run, need to factor in the compute and storage required.
Yes, but it depends on how long your perturbation runs are. If you’re just running one perturbation then it doubles the cost (more perturbations it becomes relatively cheaper).
Disconcertingly, we now have a 2nd example of a non-reproducible run. It’s unclear how often this occurs, as we don’t discover it unless we re-run and check the restarts.
The lesson is: always check the md5 hashes in the manifests in a re-run - they should match those from the original run (the only exception being ocean_barotropic.res.nc.* which always differ - but this doesn’t affect any other restarts so is presumably benign, maybe a datestamp in the file or something).