ACCESS-NRI Postgres DB for capturing provenance, usage metrics and COSIMA Cookbook index

Aidan · 17 November 2022 04:32

ACCESS-NRI is in the process of asking for some NCI virtual machine resources (VM) for running continuous integration/automated testing for model configs as well as analysis testing, e.g. for COSIMA Recipes.

After chatting with @micael and @harshula it became obvious that ACCESS-NRI should run a Postgres DB on a VM for a number of uses:

Capture metrics: data access and model runs (for reporting and justifying our existence)
Provenance: link model runs with outputs
Cookbook indexing: high performance DB, supports concurrent write access. Tracking runs could also streamline indexing by triggering indexing

I am interesting in feedback about uses cases, how this might work and what sort of resources would be required on the VM?

@dale.roberts in the other topic about the Grafana database cleanup you talked about being able to use the DB for other tasks like above. Can random users write safely to a PostgresDB? How is this configured/controlled? Does there need to be some safeguards so they don’t accidentally (or maliciously) fill the DB with garbage?

Thanks!

Aidan · 17 November 2022 04:37

Meant to ping @rbeucher, @rogeredberg and @clairecarouge into this conversation.

micael · 17 November 2022 04:59

Just one thing to add: although I mentioned postgres during our chat, depending on the access pattern (lots of reads vs lots of edits) and what you care the most (speed vs DB integrity) mysql could be a better choice than postgres.

Aidan · 17 November 2022 05:18

To be clear mysql is faster for reads and prioritises speed, and Postgres is faster for writes and places a premium on integrity?

Good to know. I don’t think speed would be such a factor compared to just using something that folks have some experience with (Postgres). But definitely worth being explicit about pros/cons and why a choice is made.

dale.roberts · 17 November 2022 07:26

Hi all.

@Aidan it is absolutely possible to configure a database to be read from/written to safely by ‘random’ users. There are 4 things that need to be configured correctly to do this

Database roles: Each tool should have a database role associated with it that is limited to only the permissions it requires to perform its job. This is usually a database owner role, a tool role that can SELECT,INSERT,UPDATE on only the necessary tables, and an _ro (read-only) role which is used for data analysis. Each of these roles should have a strong password set stored in a password manager. Note that there should be only one database superuser, the ‘postgres’ user which should only be accessible via sudo on the database host.
The pg_hba.conf file - in postgres, this configures which roles can access the database from where. The owner role should only be allowed access from the database host. The tool role can access from wherever the tool will be running, and the _ro role can have access from a wider set of IPs as you may choose to perform analysis from e.g. local machines or OOD etc.
DB host firewall - configure to match the pg_hba.conf files, plus whatever other services will run on that VM.
Placing database users/credentials in conf files on the client with appropriate permissions. When it comes to writing to the database, it should only ever be written to by a tool. The tool should read database credentials for its role from a configuration file that can only be read by the user/group that owns the tool. E.g. for hh5 analytics, the conf file would only be readable by hh5_w. There may be some cases in which the tool will have to run as a generic NCI user. In that case, whilst it wouldn’t be possible to completely obscure a database password from users with POSIX permissions, it is possible to apply general cryptographic principles within the tool itself (which would need to be compiled, not interpreted) that will make the actual database password impossible for a user to determine without reverse engineering the application.

NCI has a large postgres database system accessible from Gadi that follows these principles in order to prevent unauthorised users gaining access to it. The difference is that NCI has the ability to run setuid binaries on Gadi, whereas normal users don’t, so they don’t have to worry about anything beyond POSIX permissions for their tool configuration files.

Also, for your purposes you’d probably want postgres. If you’re seriously looking at capturing file usage, your database will very much be write-many read-once, and it’ll be hit with many thousands of entries a day.

Dale

dougiesquire · 17 November 2022 07:33

Capture metrics: data access and model runs (for reporting and justifying our existence)

Provenance: link model runs with outputs

Cookbook indexing: high performance DB, supports concurrent write access. Tracking runs could also streamline indexing by triggering indexing

Re cookbook indexing, would the idea be to provide an interface to the database via Intake (GitHub - intake/intake-sql: Plugin for Intake to read from SQL servers), or to build our own? Or we don’t know yet?

Aidan · 17 November 2022 22:51

The assumption would be to use the Cookbook catalogue with a Postgres back-end, but it is all still up for discussion.

You may be interested in this issue where I’m promoting the idea of a DB per experiment and the central “index” querying the individual DBs and only storing enough information to find experiments. One of the thought bubbles was supporting intake as a “per experiment” DB.

github.com/COSIMA/cosima-cookbook

One database for every experiment

opened 06:27AM - 04 Nov 22 UTC

aidanheerdegen

🧜🏽‍♀️ enhancement 🥞 database

What about moving to a model where every experiment has it's own DB file? The ce…ntral DB would query each individual DB for the information it needed for explorer functions. - Potentially scales better: main DB keeps a minimal amount of data required to discover experiments - Embarrassingly parallel indexing - Aligns better with other database approaches, like intake, which are discrete per experiment approach - Support intake catalogues as a source for the cookbook meta-catalogue - Maybe look at just using intake for the per-experiment DB/catalogue, and support the best of all worlds Whaddya reckon @angus-g @rbeucher @micaeljtoliveira (Yes I did think about putting this on the forum, and could still do so, but thought it was quite technical)

AFAIK intake catalogues are standalone right? So having some functionality to index and search them to find the variables you want might be quite useful and not a currently available feature?

rbeucher · 17 November 2022 22:57

I think a mixed approach with a Postgres database backend and Intake-catalogues is the way to go.
That would give us speed and the flexibility we need.
No need for the one-database per experiment approach. I think that idea came from the current sqlite approach. How would that work with Postgres?

rbeucher · 17 November 2022 23:02

Note that I asked Varvara to look at replacing sqlite with postgres in the Cookbook.
The code is tailored to sqlite and replacing the backend is not entirely trivial. We may want to think about simple tests before starting refactoring.

dougiesquire · 17 November 2022 23:34

This is somewhat like the idea of using Kerchunk to unify experiments/datasets, except that in the Kerchunk case, the “individual DBs” are just files that tell you where the individual netcdf chunks that comprise the dataset live, and how they fit together. I looked into the Kerchunk approach very briefly here:

https://github.com/ACCESS-NRI/dev_data_querying/blob/main/cosima_intake/cosima_kerchunk.ipynb

The advantage of such an approach is that you end up far fewer files to index in your DB so one could potentially use something very lightweight (e.g. a csv file via Intake-esm). The problem at the moment is that the Kerchunk files are currently just json files that end up being very large, but that could change in the not-too-distant future.

(I know you know all this already @Aidan, but sharing again in case others are interested ).

lachlanswhyborn · 12 September 2024 05:53

Closed for archival.

Topic		Replies	Views
Continuous Testing of COSIMA Recipes COSIMA cosima-recipes , model-evaluation , med	6	293	9 November 2022
Track how recently files were accessed via cookbook COSIMA cosima-cookbook	1	201	3 February 2023
Introduction to COSIMA Cookbook, Data Explorer and ACCESS-NRI Intake Catalog Workshops cosima , cosima-cookbook , training , cosima-workshop-2023	2	313	1 September 2023
Grafana database cleanup analytics grafana	16	642	12 September 2024
21st Feb 2025 - Building and Using Intake Datastores 2025 training program	9	231	29 April 2025

ACCESS-NRI Postgres DB for capturing provenance, usage metrics and COSIMA Cookbook index

Related topics