Organising global driving data for CABLE

SeanBryan51 · 3 October 2023 02:47

Global driving data for CABLE

This is a first draft in organising a standardised collection of global meteorological forcing data sets used for driving CABLE in an offline spatial configuration.

The following structure is so that we only need to specify the name of the data set to access driving data for different variables and years.

Input files are organised into the following directory structure:

.
├── <name>
│   ├── <name>_gridinfo.nc
│   ├── <name>_landmask.nc
│   ├── <variable>
│   │   ├── <name>_<variable>_<year>.nc
│   │   └── ...
│   └── ...
├── ...
└── README.md

where name is the name of data set, variable is the meteorological variable name, and year is the year of the meteorological forcing at the first time step.

Valid variables include: Rainf, Snowf, LWdown, SWdown, PSurf, Qair, Tair and Wind.

Each forcing data set contains the following files:

Path	Description
`<name>/<name>_gridinfo.nc`	Path to grid info file.
`<name>/<name>_landmask.nc`	Path to land mask file.
`<name>/<variable>/<name>_<variable>_<year>.nc`	Path to meteorological forcing input file for a given variable and year.

Any feedback on the organisation is welcome.

SeanBryan51 · 3 October 2023 04:44

Some feedback raised from the land working group meeting:

Accessibility of these data sets is crucial - no good sitting in an ACCESS-NRI project directory on NCI
Should distinguish forcing data on different temporal and spatial resolutions.
We will also want to use ACCESS model output to drive CABLE.
- A question of how to balance post processing data and writing “data set specific” code when reading in the forcing. How much flexibility should we allow for in the standard?
These data sets should be behind some data provenance framework. Especially for things like the grid info file.

Aidan · 3 October 2023 04:55

I’m curious at what level there needs to be customisation? Finding the data? Finding specific fields within data files?

SeanBryan51 · 3 October 2023 05:19

I was more getting at post processing the specific fields. Disclaimer: I haven’t looked at the model outputs from ACCESS so I’m not sure how much work would be needed.

Hopefully finding the data will be easy once we have some sort of framework set up.

Aidan · 3 October 2023 23:36

I guess I’m wondering if it is the fields that need altering, e.g. the field needs to be generated/synthesised from other existing variables, or variables regridded. Or just the metadata, e.g. variable or coordinate names, or attributes added?

If it is the just a question of finding the right variables/coordinates the cf-xarray library is awesome for inferring important coordinates and variables.

Have you considered using an intake catalogue as the front-end for accessing your datasets? Unnecessarily complex?

clairecarouge · 6 October 2023 04:08

We are talking about data that needs to be read in by CABLE. And how much pre-processing is done outside CABLE and how much inside? So the difficult question is: how much do we want to be doing in Fortran?

clairecarouge · 6 October 2023 04:13

@inh599 @RachelLaw I am not sure I understand why ACCESS data would be different from any other source of meteorological forcing data. Is the question around time interpolation? Isn’t this also a question with other meteorological forcing datasets? Shouldn’t we use the same approach for all of them? That is either have met. forcings interpolated before CABLE or allow CABLE to do the time interpolation. We could even support both approaches in a way that CABLE does the time interpolation depending on the time step in the met. forcing and not depending on the met. forcing type. So turning on/off the interpolation is driven by the actual data and not a name.

I would say to start with, the fastest is to pre-process the data entirely outside CABLE. And implement the bits we want CABLE to handle afterwards.

Aidan · 6 October 2023 04:19

AS LITTLE AS POSSIBLE!

Ok I can see the problem now. One problem is the more datasets you want to support the larger the burden.

If the data that is read in by CABLE is relatively small it might be easier to have a python pre-processing step to grab the data and plonk it in a standard place, and call this with payu before each run. Then the complexity of different data sources, metadata and formats is handled in python, which is a lot easier for this sort of functionality. Just a thought.

clairecarouge · 6 October 2023 04:28

Yes, that is something I’m interested in integrating with, useful for people doing data analysis. But I think it will work with any directory standard as long as we can define ~1 directory standard (and not one standard per data source as it seems to be right now).

inh599 · 6 October 2023 05:15

Hi all,

The different use cases of offline CABLE have different requirements/expectations of their input data - much of this, unfortunately, is hard-wired into the codebase and reflects a period of time where storage was more limited than it is now. Some use cases (such as single site studies) provide meteorology at 30 minute resolution, others such as global simulations using GSWP3 use gridded meteorology at 3 or 6 hourly time steps, yet more such as BIOS or TRENDY use gridded daily data. Different use cases expect different input variables (though there is quite a lot of overlap)

As an example of the question around pre-processing vs runtime - consider BIOS. This configuration can operate at 5km resolution over Australia, and runs for 100’s of years. It uses daily data for 7 inputs. CABLE-BIOS does the time interpolation within the model run (it’s actually weather generation, so not just linear interpolation or anything that can be done like a simple nco type pre-processing. If instead we pre-processed the inputs then stored we would a) need to rewrite large chunks of the model code [and check that this hadn’t changed answers] and b) find storage for around 24 times (maybe 48 times) the data (and you may want to trial different weather generators, so possibly more).

Regardless of approach you have to do this bespoke work for each data set (they are all different) - meaning there’s quite a lot of burden, especially for those data that are regularly updated such as BIOS and TRENDY forcing.

There’s also the element of experimental flexibility - if for instance I want to do a simple experiment. 'What happens if the world were uniformly 1K warmer? This is a lot easier to do as a one liner to the code (with a swtich) than to re-process all the meteorology. However, equally you don’t want to allow lots and lots of this kind of stuff in the code (considered the TRENDY sections of the code where depending on the experiment you use different combinations of forcing, different switches etc. It’s all embedded into the code and distributed in multiple places)

Another aspect to this is data and code provenance - it’s a lot easier to say (and justify and version control) we used external data source x (doi) and model y (git repo), than try to point to/publish a temporary data set.

Regarding ACCESS - I don’t particularly see ACCESS generated forcing as something that different. We still have to worry about which variables, units, naming convention, resolution and metadata - i.e. making sure that the ACCESS forcing can be used by offline CABLE. We still have to worry about consistency with the other ancillaries. There is a bit of a thought process needed around how we generate and store the ACCESS output - do we output daily data or the data on the model time step? Also, do we try to bias correct the forcing data (for multiple sources of error)? What do we do about data needed for the spin up process?

Basically I can see lots of issues but haven’t landed on a firm opinion on the way to go - or at least on a way forward that doesn’t involve stopping everything for a substantive period and rewriting the code.

aukkola · 9 October 2023 05:50

Second this!
In my experience the standard global datasets need changes to attributes and maybe units rather than data itself. Usually fairly easy with standard netcdf tools

aukkola · 9 October 2023 05:53

Ideally we could give the pre-processing tool data at daily resolution and it would then run a weather generator, rather than doing this inside CABLE. We are already using the CABLE-BIOS weather generator outside CABLE to make sub-daily AWAP forcings.

clairecarouge · 15 October 2023 23:18

It seems to me we will need a 2-step process (at least):

Regroup what we currently have in a “dirty” collection. Use a specific naming standard for the directory and filenames. This contains the met. forcing data we already have all around the place. We can’t make it discoverable through the NCI data catalogue, but it can be in the CABLE documentation with the proper disclaimers around the lack of provenance information. This way, we can at least start understanding what people are using.
Publish a proper collection (or collections) with the original met. forcing data, if possible, and a pre-processing tool to format the forcing data for CABLE. We may need to modify the original met. forcing data if it isn’t CF-compliant because we will want our published data to follow that standard.

Topic		Replies	Views
New Input Routines for Met Forcing Land Surface	0	83	16 May 2024
Optimizing Climate Model Output File Names and Directory Structures for Efficient Data Access with Intake Atmosphere model-evaluation	0	63	23 July 2024
CABLE site runs with ACCESS forcing CABLE cable4-planning	42	277	16 January 2025
Reference datasets needs - FY23-24 Working Group data , storage	7	337	20 August 2024
Reference Datasets FY23-24: Atmosphere WG Working Group data , storage , reference	12	641	20 August 2024

Organising global driving data for CABLE

Global driving data for CABLE

Related topics