Here’s a couple of relevant papers discussing reproducibility and testing. The first one defines 4 categories of reproducibility, and statistical tests to automate categorising the non-bit-for-bit cases
Changes, additions and updates to CICE fall into four categories: (I) BFB [bit-for-bit] with no further assessment required; (II) non-BFB but unlikely to be climate changing; (III) non-BFB and climate changing; and (IV) a new model configuration option requiring separate scientific assessment. This section describes the automated methods used to flag the first three categories.
I suppose most of the preexisting configurations around would probably be using payu, which is why I suggested the config.yaml (also gives an idea of resource requirements). But this probably ends up being a question for the technical implementation of the actual running of the tests. There’d probably be a little bit of modification required to give a testing-suitable run anyway. I think a lower barrier to entry by not requiring payu is fine?
I know that GFDL runs their tests through a pipeline on an internal Gitlab instance. I wouldn’t be surprised if there are a range of solutions from manual running, to Makefiles handed down from a supreme being (some of the developers use this for their own tests), to modern pipelines. I can try to dig around for a bit more info there.
I think the control inputs are often version controlled. GFDL has MOM6-examples, ESMG has an equivalent with their configurations, etc. There are probably private configurations, but they’d be within version control on the inside of the firewall. As for other (binary) inputs, I’m not sure! That is probably an issue we’ll have to think about too, particularly for full-chain reproducibility and provenance.
Sure But we might want to spin out a discussion on the technical implementation (running tests, validating tests, how to organise configurations, etc.). Ideally it can all be authoritative, so we don’t get a desync between what we’re testing and what’s actually being run. But also the testing is only as valuable as the tests capturing codepaths in the model that people are actually interested in.
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
5
I was worried I had derailed your topic @angus-g, so I’ve moved the discussion to this topic. Hope you don’t mind being scooped up and moved to @aekiss, but your post seemed to fit here quite well. I can move it back if you want me to.
2 Likes
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
6
Sorry, I didn’t notice the config.yaml reference.
Yes I don’t think there is a problem with a lower barrier of entry, but if payu is the preferred way to go (and I think it is), then non-payu configs will have to be converted to run with payu in any case.
I’m wondering aloud about a few things:
@MartinDix was enquiring a while ago about ways to version inputs for the rose+cylc experiments. It got me thinking about IPFS
Are there modifications we might usefully make to payu to facilitate running test cases programatically like this? The ACCESS-OM2 testing writes to config.yaml files. We could make it more seamless than that I reckon.
True. The intel compiler can also generate codecov data, which might be worthwhile thinking about to quantify coverage.