ACCESS-NRI Postgres DB for capturing provenance, usage metrics and COSIMA Cookbook index

ACCESS-NRI is in the process of asking for some NCI virtual machine resources (VM) for running continuous integration/automated testing for model configs as well as analysis testing, e.g. for COSIMA Recipes.

After chatting with @micael and @harshula it became obvious that ACCESS-NRI should run a Postgres DB on a VM for a number of uses:

  • Capture metrics: data access and model runs (for reporting and justifying our existence)
  • Provenance: link model runs with outputs
  • Cookbook indexing: high performance DB, supports concurrent write access. Tracking runs could also streamline indexing by triggering indexing

I am interesting in feedback about uses cases, how this might work and what sort of resources would be required on the VM?

@dale.roberts in the other topic about the Grafana database cleanup you talked about being able to use the DB for other tasks like above. Can random users write safely to a PostgresDB? How is this configured/controlled? Does there need to be some safeguards so they don’t accidentally (or maliciously) fill the DB with garbage?

Thanks!

Meant to ping @rbeucher, @rogeredberg and @clairecarouge into this conversation.

Just one thing to add: although I mentioned postgres during our chat, depending on the access pattern (lots of reads vs lots of edits) and what you care the most (speed vs DB integrity) mysql could be a better choice than postgres.

To be clear mysql is faster for reads and prioritises speed, and Postgres is faster for writes and places a premium on integrity?

Good to know. I don’t think speed would be such a factor compared to just using something that folks have some experience with (Postgres). But definitely worth being explicit about pros/cons and why a choice is made.

Hi all.

@Aidan it is absolutely possible to configure a database to be read from/written to safely by ‘random’ users. There are 4 things that need to be configured correctly to do this

  1. Database roles: Each tool should have a database role associated with it that is limited to only the permissions it requires to perform its job. This is usually a database owner role, a tool role that can SELECT,INSERT,UPDATE on only the necessary tables, and an _ro (read-only) role which is used for data analysis. Each of these roles should have a strong password set stored in a password manager. Note that there should be only one database superuser, the ‘postgres’ user which should only be accessible via sudo on the database host.
  2. The pg_hba.conf file - in postgres, this configures which roles can access the database from where. The owner role should only be allowed access from the database host. The tool role can access from wherever the tool will be running, and the _ro role can have access from a wider set of IPs as you may choose to perform analysis from e.g. local machines or OOD etc.
  3. DB host firewall - configure to match the pg_hba.conf files, plus whatever other services will run on that VM.
  4. Placing database users/credentials in conf files on the client with appropriate permissions. When it comes to writing to the database, it should only ever be written to by a tool. The tool should read database credentials for its role from a configuration file that can only be read by the user/group that owns the tool. E.g. for hh5 analytics, the conf file would only be readable by hh5_w. There may be some cases in which the tool will have to run as a generic NCI user. In that case, whilst it wouldn’t be possible to completely obscure a database password from users with POSIX permissions, it is possible to apply general cryptographic principles within the tool itself (which would need to be compiled, not interpreted) that will make the actual database password impossible for a user to determine without reverse engineering the application.

NCI has a large postgres database system accessible from Gadi that follows these principles in order to prevent unauthorised users gaining access to it. The difference is that NCI has the ability to run setuid binaries on Gadi, whereas normal users don’t, so they don’t have to worry about anything beyond POSIX permissions for their tool configuration files.

Also, for your purposes you’d probably want postgres. If you’re seriously looking at capturing file usage, your database will very much be write-many read-once, and it’ll be hit with many thousands of entries a day.

Dale

  • Capture metrics: data access and model runs (for reporting and justifying our existence)
  • Provenance: link model runs with outputs
  • Cookbook indexing: high performance DB, supports concurrent write access. Tracking runs could also streamline indexing by triggering indexing

Re cookbook indexing, would the idea be to provide an interface to the database via Intake (GitHub - intake/intake-sql: Plugin for Intake to read from SQL servers), or to build our own? Or we don’t know yet?

The assumption would be to use the Cookbook catalogue with a Postgres back-end, but it is all still up for discussion.

You may be interested in this issue where I’m promoting the idea of a DB per experiment and the central “index” querying the individual DBs and only storing enough information to find experiments. One of the thought bubbles was supporting intake as a “per experiment” DB.

AFAIK intake catalogues are standalone right? So having some functionality to index and search them to find the variables you want might be quite useful and not a currently available feature?

I think a mixed approach with a Postgres database backend and Intake-catalogues is the way to go.
That would give us speed and the flexibility we need.
No need for the one-database per experiment approach. I think that idea came from the current sqlite approach. How would that work with Postgres?

Note that I asked Varvara to look at replacing sqlite with postgres in the Cookbook.
The code is tailored to sqlite and replacing the backend is not entirely trivial. We may want to think about simple tests before starting refactoring.

This is somewhat like the idea of using Kerchunk to unify experiments/datasets, except that in the Kerchunk case, the “individual DBs” are just files that tell you where the individual netcdf chunks that comprise the dataset live, and how they fit together. I looked into the Kerchunk approach very briefly here:

https://github.com/ACCESS-NRI/dev_data_querying/blob/main/cosima_intake/cosima_kerchunk.ipynb

The advantage of such an approach is that you end up far fewer files to index in your DB so one could potentially use something very lightweight (e.g. a csv file via Intake-esm). The problem at the moment is that the Kerchunk files are currently just json files that end up being very large, but that could change in the not-too-distant future.

(I know you know all this already @Aidan, but sharing again in case others are interested :slight_smile: ).