Model parameter database

It occurred to me that a database of model parameters would be a very useful tool.

This would be a database from which it would be possible to find out what parameters were used for any specific run of any experiment that was indexed.

Some possible uses:

  • Easily discover parameter settings used for an experiment
  • Compare parameters across experiments
  • Track experiment parameter changes over time
  • Make model output default values, and include these as a special experiment
  • Track default values over time. Could ensure they don’t change unless expected
  • Add CI to experiment configurations to make sure values either don’t change from default, or within certain bounds, otherwise throw error
  • Generate model inputs from DB
  • Easily compare parameterisations with other models

Note that this is a #bluesky idea, but if there was enough interest it could happen.

Feel free to leave your thoughts and add uses cases I haven’t thought of.

1 Like

For the technically minded, and bear in mind this was just thinking about a single model as an example, there are other models that use other configuration methods than namelists:

Store in a SQL DB schema something like

model.parameters

id model group parameter
1 MOM5 ocean_adv_vel_diag_nml max_cfl_value
… … … …

experiments

id uuid name url
1 d7802c5c-92de-11ed-a1eb-0242ac120002 1deg_jra55_iaf https://github.com/COSIMA/1deg_jra55_iaf

run

id experiment id git hash
1 1 4786a55fcc7e769aa7941941bab213fc6fcbbe2e

experiment.parameters

id run id model.parameters.id parameter value
1 1 1 100.

Or store in GitHub, or some fileDB, json files?

Fundamentally a great idea, however it seems to be a lot of work to keep on top of things. Who would be responsible for maintenance of this database?

1 Like

There are several devils in the detail, to be sure. It isn’t worth doing unless it can be an automatic part of the model running process, IMO. In the first instance I’d imagine it just being run for ACCESS-NRI released models, but if we bake the information gathering into the tools we use, then anyone running the models at supporting centres (like NCI) could benefit from it I should think.

I did tag it as bluesky for a reason

You’re right. And I shouldn’t be such a miser. I just came back to delete my comment because after thinking about it for a bit longer, I noticed that “It’s really hard to do” does not contribute all that much to the idea, but by that time you’ve already answered, so I’ll leave the evidence of my smallmindedness here :wink:

1 Like

I was thinking about something along these lines, but from the other direction: parameter discoverability in MOM6 is painful. The only real way to do it is to run the model and observe the MOM_parameter_doc.all file, which will give the docstring from the first/last (can’t remember which) place a given parameter was read, and its value. If a parameter isn’t read, e.g. because it’s hierarchical, there’s no real way to know about it other than grepping through the source.

Recently in my Python wrapping of ALE I swapped out the pure Fortran parameter file parsing, and made it based just on a Python dictionary. This way I can just store the parameters in YAML/TOML, and change them easily at runtime (maybe the clearest example is in the regridding test).

I believe there’s a desire to keep MOM6 as a purely Fortran model – it used to have a C source file, but that was re-implemented in Fortran. It also only depends externally on MPI, NetCDF and FMS. Regardless, I was interested in seeing what other solutions are out there for specifying the model parameters, particularly with discoverability in mind. One (significantly more complex) system I’ve used is spud, which defines the parameters in the RelaxNG Compact syntax (the .rnc files in fluidity/schemas at main · FluidityProject/fluidity · GitHub). A graphical tool called Diamond parses the schema and can provide a graphical interface to editing the options. This is probably the other end of the complexity spectrum compared to what we’d need for FV ocean models (we don’t have to define Python functions to initialise fields in the model at runtime, for example)!

At least being able to see all the parameters available in the model up-front, without having to browse the source code directly would be a huge help toward discoverability, and perhaps prevent misconfigurations.

1 Like

Good point. We want both discoverability and comparibility.

I buried it a bit, but did mention

  • Generate model inputs from DB

which is part of what you’d like to be able to do. So if we added discoverability to the list of use cases then that would cover what you’re after too, am I right?

Then it’s a technical problem of how to do this. Passively sucking out parameter values from runs and adding them to a DB or actively hunting them in the code.

The former would definitely not be complete for MOM6 until all parameter use and combinations are covered (ever?).

The latter relies on static code analysis (I believe grep counts if you’re skilled with -E :wink: ). Could you easily determine all the parameters and their default values with a proper FORTRAN linter (like flint or flint)?

I waved my hands a bit when I said

  • Make model output default values, and include these as a special experiment

which it sounds like wouldn’t work for MOM6 as it currently works, but is also a laudable goal as it is super useful to determine when parameters deviate from the default.

Having a system where user-settable parameter defaults are defined in a file which is either read at run-time, or used to generate code at compile time (like spud?) would make discoverability a lot easier.

Peripherally related: wasn’t familiar with TOML and for others in the same boat this is a great explanation of how TOML and YaML differ, use cases and their strengths and weaknesses.

That’s right. Being able to also reverse from a database to a configuration is probably good for most cases, but knowing the full set of available parameters is a slightly different problem. I guess part of this is also knowing what the parameters do. I suppose for MOM5, it’s probably possible to point to some section of the book, but at least a documentation string for the parameter would be more useful than its configuration key in the model.

Possibly a combination of both? Clearly for the majority of your proposed uses, you want to see what was used for the actual run. Perhaps my point was that given we’d be resorting to hunting for parameters in the code, why not change the code to be friendlier for discoverability, while not necessarily affecting existing workflows. This could also help to avoid possible issues: currently, the docstring for a parameter is provided when you request that parameter. If you request the same parameter from different modules, you should make sure you’re documenting it in the same way, or that updates to this documentation are synchronised.

Do you mean to use this as a mechanism for querying the default values? I guess at least MOM6 has the advantage of including these defaults in the MOM_parameter_doc.all (as opposed to the MOM_parameter_doc.short, which only includes those parameters which deviate from the default). I feel like something along these lines should be included as a baseline in a model: you want to know what parameters were actually used in all cases!

1 Like

Definitely. At the very least reverse-engineering the parameters from the run outputs will be required to document a run unless the engineering to set the parameters also spits out some nice machine-readable parameter file (entirely possible and desirable as you say).

This is also model dependent. New models that are still being actively developed would definitely be a target for modifications such as this (MOM6, CICE6), but the case for doing this for mature models in maintenance mode (MOM5, CICE5) is not so strong. So a reverse-engineered solution might be the right one in that case.

Yep.

Absolutely. So what I was suggesting is redundant for MOM6, but the older models might benefit from something like that.

FURIOUS AGREEMENT!

1 Like

Agreed that this would be a useful capability to have, if it can be done without inordinate effort. Model inputs (e.g. hashes from manifests) could also be useful to have in a DB, as would information on run resource use.

I’ve made attempts in this direction (detailed below), but a DB would make this sort of info easier to make use of.

nmltab can do this for namelist files, and was used to decide on the parameters used in ACCESS-OM2 and tabulate differences between experiments in the appendices of the draft ACCESS-OM2 tech report

run-summary uses nmltab to track nml parameter changes automatically (e.g. the rightmost columns here). It also collects run resource use information, which has been useful for performance and scaling testing.

However these give an incomplete picture as they’re based on nml input files and don’t include the default values. Also some parameters are subject to a master switch that can make their values irrelevant.