Organising global driving data for CABLE

Global driving data for CABLE

This is a first draft in organising a standardised collection of global meteorological forcing data sets used for driving CABLE in an offline spatial configuration.

The following structure is so that we only need to specify the name of the data set to access driving data for different variables and years.

Input files are organised into the following directory structure:

.
├── <name>
│   ├── <name>_gridinfo.nc
│   ├── <name>_landmask.nc
│   ├── <variable>
│   │   ├── <name>_<variable>_<year>.nc
│   │   └── ...
│   └── ...
├── ...
└── README.md

where name is the name of data set, variable is the meteorological variable name, and year is the year of the meteorological forcing at the first time step.

Valid variables include: Rainf, Snowf, LWdown, SWdown, PSurf, Qair, Tair and Wind.

Each forcing data set contains the following files:

Path Description
<name>/<name>_gridinfo.nc Path to grid info file.
<name>/<name>_landmask.nc Path to land mask file.
<name>/<variable>/<name>_<variable>_<year>.nc Path to meteorological forcing input file for a given variable and year.

Any feedback on the organisation is welcome.

1 Like

Some feedback raised from the land working group meeting:

  1. Accessibility of these data sets is crucial - no good sitting in an ACCESS-NRI project directory on NCI
  2. Should distinguish forcing data on different temporal and spatial resolutions.
  3. We will also want to use ACCESS model output to drive CABLE.
    • A question of how to balance post processing data and writing “data set specific” code when reading in the forcing. How much flexibility should we allow for in the standard?
  4. These data sets should be behind some data provenance framework. Especially for things like the grid info file.

I’m curious at what level there needs to be customisation? Finding the data? Finding specific fields within data files?

I was more getting at post processing the specific fields. Disclaimer: I haven’t looked at the model outputs from ACCESS so I’m not sure how much work would be needed.

Hopefully finding the data will be easy once we have some sort of framework set up.

I guess I’m wondering if it is the fields that need altering, e.g. the field needs to be generated/synthesised from other existing variables, or variables regridded. Or just the metadata, e.g. variable or coordinate names, or attributes added?

If it is the just a question of finding the right variables/coordinates the cf-xarray library is awesome for inferring important coordinates and variables.

Have you considered using an intake catalogue as the front-end for accessing your datasets? Unnecessarily complex?

1 Like

We are talking about data that needs to be read in by CABLE. And how much pre-processing is done outside CABLE and how much inside? So the difficult question is: how much do we want to be doing in Fortran? :slight_smile:

@inh599 @RachelLaw I am not sure I understand why ACCESS data would be different from any other source of meteorological forcing data. Is the question around time interpolation? Isn’t this also a question with other meteorological forcing datasets? Shouldn’t we use the same approach for all of them? That is either have met. forcings interpolated before CABLE or allow CABLE to do the time interpolation. We could even support both approaches in a way that CABLE does the time interpolation depending on the time step in the met. forcing and not depending on the met. forcing type. So turning on/off the interpolation is driven by the actual data and not a name.

I would say to start with, the fastest is to pre-process the data entirely outside CABLE. And implement the bits we want CABLE to handle afterwards.

1 Like

AS LITTLE AS POSSIBLE! :slight_smile:

Ok I can see the problem now. One problem is the more datasets you want to support the larger the burden.

If the data that is read in by CABLE is relatively small it might be easier to have a python pre-processing step to grab the data and plonk it in a standard place, and call this with payu before each run. Then the complexity of different data sources, metadata and formats is handled in python, which is a lot easier for this sort of functionality. Just a thought.

Yes, that is something I’m interested in integrating with, useful for people doing data analysis. But I think it will work with any directory standard as long as we can define ~1 directory standard (and not one standard per data source as it seems to be right now).

Hi all,

The different use cases of offline CABLE have different requirements/expectations of their input data - much of this, unfortunately, is hard-wired into the codebase and reflects a period of time where storage was more limited than it is now. Some use cases (such as single site studies) provide meteorology at 30 minute resolution, others such as global simulations using GSWP3 use gridded meteorology at 3 or 6 hourly time steps, yet more such as BIOS or TRENDY use gridded daily data. Different use cases expect different input variables (though there is quite a lot of overlap)

As an example of the question around pre-processing vs runtime - consider BIOS. This configuration can operate at 5km resolution over Australia, and runs for 100’s of years. It uses daily data for 7 inputs. CABLE-BIOS does the time interpolation within the model run (it’s actually weather generation, so not just linear interpolation or anything that can be done like a simple nco type pre-processing. If instead we pre-processed the inputs then stored we would a) need to rewrite large chunks of the model code [and check that this hadn’t changed answers] and b) find storage for around 24 times (maybe 48 times) the data (and you may want to trial different weather generators, so possibly more).

Regardless of approach you have to do this bespoke work for each data set (they are all different) - meaning there’s quite a lot of burden, especially for those data that are regularly updated such as BIOS and TRENDY forcing.

There’s also the element of experimental flexibility - if for instance I want to do a simple experiment. 'What happens if the world were uniformly 1K warmer? This is a lot easier to do as a one liner to the code (with a swtich) than to re-process all the meteorology. However, equally you don’t want to allow lots and lots of this kind of stuff in the code (considered the TRENDY sections of the code where depending on the experiment you use different combinations of forcing, different switches etc. It’s all embedded into the code and distributed in multiple places)

Another aspect to this is data and code provenance - it’s a lot easier to say (and justify and version control) we used external data source x (doi) and model y (git repo), than try to point to/publish a temporary data set.

Regarding ACCESS - I don’t particularly see ACCESS generated forcing as something that different. We still have to worry about which variables, units, naming convention, resolution and metadata - i.e. making sure that the ACCESS forcing can be used by offline CABLE. We still have to worry about consistency with the other ancillaries. There is a bit of a thought process needed around how we generate and store the ACCESS output - do we output daily data or the data on the model time step? Also, do we try to bias correct the forcing data (for multiple sources of error)? What do we do about data needed for the spin up process?

Basically I can see lots of issues but haven’t landed on a firm opinion on the way to go - or at least on a way forward that doesn’t involve stopping everything for a substantive period and rewriting the code.

1 Like

Second this! :grin:
In my experience the standard global datasets need changes to attributes and maybe units rather than data itself. Usually fairly easy with standard netcdf tools

Ideally we could give the pre-processing tool data at daily resolution and it would then run a weather generator, rather than doing this inside CABLE. We are already using the CABLE-BIOS weather generator outside CABLE to make sub-daily AWAP forcings.

It seems to me we will need a 2-step process (at least):

  1. Regroup what we currently have in a “dirty” collection. Use a specific naming standard for the directory and filenames. This contains the met. forcing data we already have all around the place. We can’t make it discoverable through the NCI data catalogue, but it can be in the CABLE documentation with the proper disclaimers around the lack of provenance information. This way, we can at least start understanding what people are using.

  2. Publish a proper collection (or collections) with the original met. forcing data, if possible, and a pre-processing tool to format the forcing data for CABLE. We may need to modify the original met. forcing data if it isn’t CF-compliant because we will want our published data to follow that standard.