Provenance of ACCESS Models

Provenance

What is it?

At its heart provenance is being able to determine, with certainty, where some data came from. Its origin and how it has been transformed from that original source.

A concrete example: a researcher produces an analysis from a netCDF data file contained in a data collection on gadi.nci.org.au, e.g. CMIP6. When including this analysis in a paper the researcher needs to know some information about the data used in the analysis.

Some examples:

  • How was it derived?
  • Where does it come from?
  • If it is the output of a model configuration, what version of the various model codes were used?
  • What are the settings of relevant model parameterisations?
  • What is the source of a relevant data input to the model?

Why is it important?

As the example above shows, data provenance is critical to the scientific process. Analyses often rely on details of how the data was produced, transformed and what other data products might have been used to create it. Being able to access this information is crucial. Being able to access it easily and conveniently should be standard. If this process is opaque and difficult it is a massive productivity drain.

Provenance is also an essential element of reproducibility. To be able to reproduce the data full provenance information is required. If the data can’t be reproduced is it really science?

Why are we talking about it now?

Provenance is something that must be built into systems from the very beginning. If the correct metadata, tags and identifiers are not included in process that builds, configures and runs models then that data will not be available to be used in a provenance system. So this needs to be considered now, at the very early stages of the ACCESS-NRI model development and release process.

ACCESS-NRI has a new staff member, @JamesWilmot, who is one of the lead developers of Open Data Fit, a system that facilitates scientific analysis with full provenance and reproducibility. Because of his existing experience James will be researching current best practice for provenance in climate modelling, understanding what provenance information is already produced by the existing tools, and how we might improve our systems to get closer to the best practice ideal.

How do you fit in?

We need input from YOU. The community. We need to understand how you currently access provenance data from climate model outputs, or climate model analyses. Where it works, where it doesn’t. Where it works, but badly.

WE NEED YOUR USER STORIES!

See this explanation of what user stories are and how they are essential.

I hope you don’t feel like I’m picking on you @willrhobbs, but I seem to recall you talking about issues with the avallability of provenance related data from CMIP model outputs. If you have any insights, particularly when it comes to who does it well, who does it poorly, and the good and bad of current ACCESS models that would be great.

@adele-morrison @aekiss @PSpence @JuliaN @rmholmes do you have any stories about frustrations in getting model run information for model outputs? Your own or other peoples.

Thanks Aidan.

I’m not sure if this is a good example of what you’re asking for. However, in the context of the CMIP ACCESS runs I would point at issues such as this one as a case where some more thought might be useful. For my analysis of heat budgets in the CMIP ACCESS runs I’ve been using the raw output data on NCI which is still in MOM5 format. Others trying to do similar analysis with the post-processed CMIP/ESMF-format data sometimes run into problems. While it’s clear that the post-processing step is essential given the need to compare between different models, it can be problematic in terms of tracking the data flow and finding issues.

Don’t feel picked on at all here, but I"m not sure that I have much to add to this; I have a lot of experience with CMIP5 but haven;t done a whole lot yet with CMIP6, so my insights are a bit out of date. @Paola-CMS is probably the local expert.

The thing I’d really like to have for CMIP is an easy/quick look up for each model’s science characteristics, like a table/database. This would include the model name/version for each component (so taking ACCESS as an example, ocean = MOM5, atmos = UM etc), the approx resolution of each of those components, and ideally the relevant citation for the model and all its components.

From a science perspective, albeit only of interest to a select few, would be info on where to find the parameter values that were used for each run (even if just in the form of a setup file).

There was noting like this in CMIP5 (so I kind of created my own on an ad-hoc basis); I’ve seen steps towards this for CMIP6 but I can’t remember where or find it (which I guess is already an issue).

The essential point is that it’s not reasonable for modelling groups to keep this on their own sites/github repos, no matter how detailed the info. Researchers trying to incorporate info from dozens of models need a central data source.

1 Like

I have a note to myself about making a model parameter database, which I think ticks this box. It is only a thought bubble at the moment, but is something I am very keen on.

You have spurred me into action, and I’ve made a topic specifically for that.

1 Like

That is an excellent example I think. There is currently no resolution in that topic, and clearly it is not something that a user who isn’t familiar with the post-processing can determine.

This is an issue that could be easily resolved by having a metadata standard for post-processed files (that people actually used).

It shouldn’t be a big ask to include on any post-processed file the following as global attributes:
model name, version, experiment name, experiment member id, file creation date and name of teh script that was used to create the post-processed file (if the script is on a github repo than bingo! a link to the script). You can also add any relevant citations and email address for a contact person.

Would that address the issues @rmholmes, and if so how could we encourage/enforce that standard?

1 Like

Yes that would be great. If scripts are on github then it’s easy to put in issues/PRs to correct mistakes.

It seems like this is something that needs to be agreed upon by the people in charge of post-processing.

CMIP6 tried to address some of these issues, as enforcing sharing more detailed model information and adding more global attributes to the files covering experiment_id etc which are missing from CMIP5. It’s far from perfect and probably doesn’t cover parameters history, as it looks like they’re often overlooked in model documentations in general. However, it should probably be a starting point for anyone just starting working on provenance in a climate model context: CMIP6 Participation Guidance for Modelers .

1 Like

Not a CMIP story, but I make a lot of use of payu’s git-tracking of run history, including manifests with hashes of all executables, inputs and restarts. It’s a great feature and could be leveraged for providing specific provenance information.

1 Like

+1 to @aekiss’s comment here. The only thing that is missing is having a copy of these config git histories somehow automatically included along with the output data.

A clone of the git history is added to the experiment output directory if you use sync_data.sh to copy your data over.

2 Likes

Not really a user story, but here is an example of how this kind of issue has been handled in a different scientific area that might inspire you and/or provide some interesting ideas. There are several frameworks that were developed in the material science community to run workflows and keep track of provenance. One that I find particularly well done is AiiDA, which stores provenance information for all the inputs, outputs, workflows and codes as an acyclic directed graph (the same type of graph used in git). Other tools then take advantage of this information in different ways. For example, this website allows users to upload their database of calculations and make them public. You can then browse the different inputs, outputs, workflows, etc, including the provenance graph, here.

2 Likes