Versioning Model Inputs


Earth system models usually rely on data from many input files, from grid information and layout, to chlorophyll concentration, or iceberg climatologies.

Historically the provenance of many of these files has not been available, and in many cases there is no information on how it was created, or any ability to recreate it from original source data.

When multiple copies of an input file exist it can also be difficult to find out why the differ.

Future work

We need to have some way to store better input file provenance. Including an ability to version these files with meaningful meta-data describing the reasons a new version was required. This should also include information on how the updated version differs and a method to recreate the modifications to regenerate the updated version from the previous version.

Related topics and issues

Call to action

If you know of any existing tools or approaches that would help address this issue please reply and give some details: links to software, personal experience with other tools, dob in someone you know who might be able to help.

1 Like

TileDB is an open source project that implements an n-dimensional database that claims to be able to do versioning, which is a real pain for binary data

This is an interesting presentation about using TileDB in Aus Seabed project from a GeoScience Australia technical team lead.

They were very impressed with TileDB.

@MartinDix asked if it was possible to have a way to refer to/point to an input file that was universal, and not tied to a specific HPC or filesystem.

The answer is yes, IPFS, but I don’t know how feasible it would be to use in a climate model context. Cool though.

What is the frequency of changes for a specific input file?
DOI (or whatever other identifier you prefer) can be attached to dataset and they would link to complete information on the specific file.
Do you need a different method?

1 Like

Good question. The infuriatingly vague answer is “it depends”. The grid stuff that motivated this initially is likely on the scale of years. There were a series of updates to the 0.25 ocean bathymetry to address issue that had been discovered, and there might have been a few changes over the space of a year.

If there was a input that was derived from a regularly updated product then it might be monthly, or quarterly.

Frequencies also vary a lot over the development cycle of a model. The ACCESS-OM3 team are at the initial stages of model configuration development, so the changes could well be very frequent, perhaps weekly at some stages when there is rapid iteration on an aspect of a model.

It’s a good suggestion, and yes I think it is probably a good solution. Probably it is something that should be done anyway, even if there are other methods, like TileDB, that would be useful for operational implementation (apologies for the appalling jargon).

right now I’m feeling like a data agony aunt always ready for all your data woes

1 Like

I think there are two (similar) approaches we could take here.

Option 1. Treat input files like other configuration files

  • Efforts to version control how inputs files are made are relatively minor (i.e. they are descriptive only to capture the process and source files, scripts only have cursory code review).
  • The main version control is the input files themselves, in Git Large File Storage (LFS) or similar.
  • Every so often there is a formal release (probably done in conjunction with a config release, could be as simple as a tag on github).
  • Model configs could point to a Git-hash?

This option suits where we are currently at a bit better. Input files have unclear provenance and sometimes have been generated from other data files, although the history and versioning of these is not that clear. This options allows manipulation and changes to input files in the current manor - developers and model users change and uses files using their preferred scripting/analysis tools.

Option 2. Treat input files like data

  • Create input files through software under version control, fully documented and ensuring uniqueness identification of any data files and software versions used.
  • The input files are version controlled through releases (in the same way as data) but development copies between releases are not kept.
  • Model configs then need to specify a specific version of the input files, which needs to be check through a bin/md5-hash. (I don’t quite understand what payu does here, I think it records the hash but doesn’t enforce it is unchanged).

This option is more sustainable long term and gives better records of the provenance. It will be hard to transition our current input files (eventually they would all need to be remade). It forces us to all make input files with the same/consistent toolset (i.e. another python package :slight_smile: ) and makes it more likely we will get the CF-metadata correct and consistent.

Comparing the input files is always going to be hard. In the OM cases they are netcdf files, so can be compared using normal netcdf tools (although there could be versions of the files which are functionally the same but formatted differently in netcdf).

I haven’t investigated Github LFS in detail. It would be most useful if we could use it in a way where we had one cache of the files on gadi, and then model configs could just point to the right versions of the files ( i.e. without the user messing with file paths, or hashes!). This is like a cross between a locally-hosted cloud source for the input files (e.g. netcdf DAP server?) and an on-premises git lfs server.

1 Like