Provenance
What is it?
At its heart provenance is being able to determine, with certainty, where some data came from. Its origin and how it has been transformed from that original source.
A concrete example: a researcher produces an analysis from a netCDF data file contained in a data collection on gadi.nci.org.au
, e.g. CMIP6. When including this analysis in a paper the researcher needs to know some information about the data used in the analysis.
Some examples:
- How was it derived?
- Where does it come from?
- If it is the output of a model configuration, what version of the various model codes were used?
- What are the settings of relevant model parameterisations?
- What is the source of a relevant data input to the model?
Why is it important?
As the example above shows, data provenance is critical to the scientific process. Analyses often rely on details of how the data was produced, transformed and what other data products might have been used to create it. Being able to access this information is crucial. Being able to access it easily and conveniently should be standard. If this process is opaque and difficult it is a massive productivity drain.
Provenance is also an essential element of reproducibility. To be able to reproduce the data full provenance information is required. If the data can’t be reproduced is it really science?
Why are we talking about it now?
Provenance is something that must be built into systems from the very beginning. If the correct metadata, tags and identifiers are not included in process that builds, configures and runs models then that data will not be available to be used in a provenance system. So this needs to be considered now, at the very early stages of the ACCESS-NRI model development and release process.
ACCESS-NRI has a new staff member, @JamesWilmot, who is one of the lead developers of Open Data Fit, a system that facilitates scientific analysis with full provenance and reproducibility. Because of his existing experience James will be researching current best practice for provenance in climate modelling, understanding what provenance information is already produced by the existing tools, and how we might improve our systems to get closer to the best practice ideal.
How do you fit in?
We need input from YOU. The community. We need to understand how you currently access provenance data from climate model outputs, or climate model analyses. Where it works, where it doesn’t. Where it works, but badly.
WE NEED YOUR USER STORIES!
See this explanation of what user stories are and how they are essential.