Organising global driving data for CABLE

Hi all,

The different use cases of offline CABLE have different requirements/expectations of their input data - much of this, unfortunately, is hard-wired into the codebase and reflects a period of time where storage was more limited than it is now. Some use cases (such as single site studies) provide meteorology at 30 minute resolution, others such as global simulations using GSWP3 use gridded meteorology at 3 or 6 hourly time steps, yet more such as BIOS or TRENDY use gridded daily data. Different use cases expect different input variables (though there is quite a lot of overlap)

As an example of the question around pre-processing vs runtime - consider BIOS. This configuration can operate at 5km resolution over Australia, and runs for 100’s of years. It uses daily data for 7 inputs. CABLE-BIOS does the time interpolation within the model run (it’s actually weather generation, so not just linear interpolation or anything that can be done like a simple nco type pre-processing. If instead we pre-processed the inputs then stored we would a) need to rewrite large chunks of the model code [and check that this hadn’t changed answers] and b) find storage for around 24 times (maybe 48 times) the data (and you may want to trial different weather generators, so possibly more).

Regardless of approach you have to do this bespoke work for each data set (they are all different) - meaning there’s quite a lot of burden, especially for those data that are regularly updated such as BIOS and TRENDY forcing.

There’s also the element of experimental flexibility - if for instance I want to do a simple experiment. 'What happens if the world were uniformly 1K warmer? This is a lot easier to do as a one liner to the code (with a swtich) than to re-process all the meteorology. However, equally you don’t want to allow lots and lots of this kind of stuff in the code (considered the TRENDY sections of the code where depending on the experiment you use different combinations of forcing, different switches etc. It’s all embedded into the code and distributed in multiple places)

Another aspect to this is data and code provenance - it’s a lot easier to say (and justify and version control) we used external data source x (doi) and model y (git repo), than try to point to/publish a temporary data set.

Regarding ACCESS - I don’t particularly see ACCESS generated forcing as something that different. We still have to worry about which variables, units, naming convention, resolution and metadata - i.e. making sure that the ACCESS forcing can be used by offline CABLE. We still have to worry about consistency with the other ancillaries. There is a bit of a thought process needed around how we generate and store the ACCESS output - do we output daily data or the data on the model time step? Also, do we try to bias correct the forcing data (for multiple sources of error)? What do we do about data needed for the spin up process?

Basically I can see lots of issues but haven’t landed on a firm opinion on the way to go - or at least on a way forward that doesn’t involve stopping everything for a substantive period and rewriting the code.

1 Like