Grafana database cleanup

After poking around the grafana setup over the last few days, I have some thoughts about how we can simplify the stats capture process. It requires a few changes to the postgres setup that need to happen all at once, but it shouldn’t disrupt the gathering of data.

The basic setup I have in mind is as follows:
Create a role within the postgres database to take ownership of the tables and views.
Create a role within the postgres database with SELECT,INSERT,USAGE privilege on the tables that will be used by the stats capture process. Use pg_hba.conf to allow this user to connect from a specific external IP range (it’ll be either accessdev or gadi’s external IP range depending on what the database thinks the source of the connections is).
Add reverse proxy between accessdev and port 5432 on jenkins7 to the httpd configuration - update firewall on jenkins7 to only allow connections on that port from accessdev.
This negates the need to establish an ssh tunnel from Gadi when capturing stats - meaning a private key does not need to be generated for the service user on Gadi.

I understand that exposing the database to Gadi may be of concern, but this is fairly standard. NCI uses this model with its internal database servers. As long as the firewall and database are configured correctly, there are no additional security concerns. This would require modifying the puppet config for accessdev and ansible config for jenkins7. Does still exist?

Once that is complete, the stats capture scripts can be modified as it will no longer need to perform the ssh tunneling. It would also be advisable to modify the stats capture processes to no longer parse human-readable data (like nci_account_json already does). In the case of nci-files-report, use the WSGI service used by the client utility. In the case of lquota, use the lustre API packaged with lquota (/opt/nci/lquota) instead of lquota directly.

The use of ‘doit’ and the ‘ncigrafana’ package doesn’t need to change, but direct access to the database negates the need to separate the ‘upload’ steps from the ‘dump’ steps. Direct database access also opens up the possibility of analytics at a user-level, without needing to rely on world writable temp files or service users or other additional steps. Any stats capture mechanism can insert straight to the database, with appropriately restricted postgres roles.

Any feedback would be very much appreciated.

Sounds brilliant @dale.roberts

Designing away the requirement for ssh keys is a great idea, making the whole process much more robust.

What do you mean by “analytics at a user-level”? Adding hooks to run scripts/tools to report PBS job analytics for personal or systemic use? Logging conda package loads, data access through tools like clef? That would be awesome!

The other way works too. The DB could be used as a source of data for CLI tools. Grafana is great, but sometimes CLI is nice, and an API means anyone can hack up the tool they want, or just dump the data into a pandas table and play with it.

Do you have a plan for how to transition to this?

@rbeucher some of what Dale is planning for capturing usage might be very interesting for you I think.

Those are great ideas. In time I think we can do all of them. My plan was to start with python imports as a way to maintain the hh5 environment.

As for transition, I think the stuff on the postgres database would be fairly straightforward to do without any downtime. Its a series of ‘CREATE ROLE …’ and ‘ALTER TABLE …’ postgres commands. Then I’d need to dig into the ansible config for jenkins7 and the puppet config for accessdev to enact the firewall/reverse proxy changes. Stats gathering can run as-is in the background while all of this is happening. Once that’s all complete, I can get rid of start_tunnel in dumpstats, and modify the config file to point to the new database URL (same database, just a different URL to get to it). After that, start working on the updated dump_* tasks.

If you’re looking at re-doing the database it may be worthwhile splitting the grafana server into a separate VM on its own cloud tenant. That way you can have the reverse proxy on the same VM and not need to deal with accessdev, which we’re trying to move off of.

I wasn’t going to redo the database entirely, just harden it to make it safe to expose to Gadi. You wouldn’t need a reverse proxy at all if it lived on a VM with an external IP, just a firewall with port 5432 open (again, just like NCI does with its own internal databases). And in any case, my time frame for implementing the database changes would ideally be in the next week or two. I suspect accessdev will still be around then. Personally, I don’t think accessdev should be decommissioned at all, but that’s out of scope of this discussion…

Fair enough. is still around, there might be a firewall around it restricting what IPs can access it? The repository is, modules/jenkins/manifests/proxy.pp shows how the jenkins proxy was set up.

Just a quick comment for @aidanheerdegen , we are already collecting CleF stats when the tool is run from the command line. It could be done better of course, the main issue is that they haven’t really be used yet, as they would be useful only if there was a plan on how to retire some of the CMIP data that never get accessed, to make space for new downloads. However, there’s no such procedure in place, and also CleF stats aren’t a direct measure of data usage. Potentially measuring data access via the central Intake catalogues, provided they become more popular, would be an interesting stat.

Well, looks like NCI has changed the order in which I’ll take on thses tasks. They’ve introduced a new header in nci-files-report that breaks the parse_user_storage_data script, which is preventing the ingest of new data. Rather than just working around that, I’ll re-work it and the other grafana tools so that it pulls data from the backing API rather than trying to parse human readable output. Then I’ll harden the database, etc.

The first pass of this is done, I’ve made a PR on the ncigrafana repo. I’m not sure what the procedure is to actually get this deployed, but I suppose the failing tests need to be addressed first. The test that fails is irrelevant now that parse_user_storage_data doesn’t actually parse nci-files-report output now. I can try and re-work the test too, set up a pretend server on a unix domain socket have it feed pretend data to the script? I suspect that’d be more work than the actual change to parse_user_storage_data though. Would appreciate thoughts on how to proceed.

I can give some feedback directly on the PR

I didn’t notice this before. NCI will allow VMs with an external IP? I’m thinking about exposing information via web interfaces and just assumed it wouldn’t be allowed. Now I am intrigued.

@aidanheerdegen this is why I keep going on about accessdev. Its has an external IP, a well configured firewall, automated provisioning. All the hard work is done, it just needs updating.

@aidanheerdegen sorry, it just occurred to me that the use of ‘external’ above may be ambiguous. All VMs created by OpenStack are visible on an internal 10.x.x.x network. An external IP just means that it can be accessed from the wider internet. It does not mean you can assign some arbitrary IP to a VM, it has to be an IP in a pool that is assigned to OpenStack.

Is that a static IP? I’m assuming not every VM gets an external IP address, and that there is some process to get one, as it makes the VM more vulnerable to intrusion.

There is a bit of a distinction here, all OpenStack VMs get an internal, static IP. They can’t be reached from the internet from this IP, its more for OpenStack to be able to issue commands and whatnot. A ‘floating IP’, which is a static, external IP, can be attached to any one VM. Vulnerability to intrusion then comes down to OpenStack security groups, firewall policy, etc. In order to gather stats from Gadi, you’ll need a VM that Gadi is able to contact. There are options, the database-hosting VM doesn’t necessarily need to have a public IP, you could use another VM on the same internal VLAN as a jump node with a port-forwarding setup, similar to the relationship between accessdev and jenkins7. Though creating ad-hoc ssh port-forwards whenever you need to gather data is… less than ideal.

I gather from you reply that by default gadi cannot connect to an OpenStack VM? But the options were a public IP or jumping via a VM with a public IP. Is there are an option for a local IP address that gadi can see but isn’t visible outside the NCI network?