Proposed CABLE Output Redesign

As part of our work improving the usability, performance and maintainability of the CABLE code, we are redesigning the way CABLE handles output. Internally, this means leveraging parallel I/O systems and providing a uniform way of defining output in the code. For users, this means the opportunity for us to provide a much more flexible and powerful API for specifying exactly what you want from your simulations.

Below is our proposal of what that API would look like. What it’s capable of, it’s limitations and ways to achieve common desires. There’s also a fairly plain text PDF (96.8 KB) if you’d prefer to read that. We want to get your feedback before we go ahead with this- are there things you want to achieve that this wouldn’t address, features you’d like to see that aren’t included.


Output Configuration File

The output configuration file is a YAML file, designed to allow flexible output patterns for a range of use cases. The output configuration is centred around two ideas: output streams and output variables. An output stream is linked to a single output file and write frequency (to avoid multiple time axes in a single file), and each output variable is associated with a single stream. As such, the configuration file is split into two sections: streams and variables.

Streams

An output stream directs a set of variables to a given output file. An output configuration can have any number of streams. Each stream must define:

  • file_name: File name to write the stream to.
  • frequency: The writing frequency, which is the same as the aggregation frequency (see the Variables section for aggregation methods). The available frequencies are:
    • timestep: write on the base model timestep.
    • 3hrly: write every 3 hours.
    • daily: write at the end of each day.
    • monthly: write at the end of each month.
    • yearly: write at the end of each year.

These are the only settings required for a minimum stream specification. There are additional optional settings which can be defined:

  • netcdf_name: The NetCDF variable name template to apply by default to each variable This can be overwritten per variable in the variables section. It supports string substitution. Defaults to "{field_name}".
  • shuffle: Whether to apply shuffle compression. Defaults to true.
  • compression_level: What compression level to apply. Defaults to 1.
  • separate_file_per_variable: Whether to write each variable in this stream to an individual file e.g. to match CMIP standards. The file names match the netcdf_name, so overrides the file_name specification. Defaults to false.
  • metadata: A sub-dictionary listing global attributes to apply to the NetCDF file. It supports string substitution. Defaults to an empty dictionary.

An example of a minimum valid stream specification would be:

streams: 
    1:
    	file_name: cable_output.nc
        frequency: daily

In this case, every variable directed to stream 1 would be aggregated and written with daily frequency to cable_output.nc. The NetCDF variable names would be the field names for each variable. An example of a maximum valid stream specification would be:

streams:
    1:
    	file_name: cable_output.nc
        frequency: monthly
        netcdf_name: "{field_name}_{aggregation_method}"
        shuffle: true
        compression_level: 4
        separate_file_per_variable: false
        metadata:
            model: CABLE
            experiment: example_experiment_id

Here, every variable directed to stream 1 would be aggregated and written with monthly frequency to cable_output.nc, compressed with _Shuffle=True and _DeflateLevel=4. The NetCDF variable names would be the field names followed by the aggregation method. The global metadata would contain the model: "CABLE" and experiment: "example_experiment_id" entries.

Variables

An output variable defines precisely what quantities should be written for the given simulation. The variables are provided in a list format. Each variable must define:

  • name: The name of the variable being described. This is the same as the field_name (see String Substitution).
  • stream: The stream to direct the current variable to. Each variable can only be directed to a single stream.
  • aggregation: The aggregation method to apply to the variable. The aggregation period is defined by the frequency of the target stream. The available aggregation methods are:
    • mean: write the per-element average over the period.
    • sum: write the per-element sum over the period.
    • max: write the per-element maximum over the period.
    • min: write the per-element minimum over the period.
    • instant: write the instantaneous state at the time of writing.

These are the only settings required for a minimum variable specification. The full list of variables which can be written will be described in the CABLE documentation. There are additional optional settings which can be defined:

  • reduction: Which reduction method to apply to the variable. See the Reduction Methods section for information on reduction methods. Defaults to none.
  • netcdf_name: The NetCDF name to use for the specified variable. It supports string substitution. If specified, overrides the stream netcdf_name. Defaults to "{field_name}".
  • metadata: A sub-dictionary listing variable attributes to apply to the NetCDF variable. It supports string substitution. There are some attributes that are reserved, which are always applied:
    • standard_name: The CF compliant name where it’s already defined, otherwise a name will be created based on the CF guidelines.
    • long_name: A long name describing the variable.
    • units: Units of the data.
    • cell_methods: The appropriate cell methods for the specified frequency and aggregation method.

An example of a minimum variable specification would be:

variables:
    - name: GPP
      stream: 1
      aggregation: mean

In this instance, the mean of the GPP would be written to stream 1, with the frequency specified in the configuration of stream 1. An example of a maximum valid variable specification would be:

variables:
    - name: GPP
      stream: 1
      aggregation: mean
      reduction: none
      netcdf_name: GPP_mean
      metadata:
        description: Computed using X algorithm.

Groups and Modules

To facilitate configuration of many linked variables at once, groups and modules are provided as convenience tools. Each group encompasses a set of variables, and each module encompasses a set of groups. Groups and modules permit all the same settings as a variable configuration, but they apply the settings to all variables within the group or module. An example of a valid group specification would be:

groups:
    - name: carbon_pools
      stream: 1
      aggregation: mean
      netcdf_name: "{field_name}_{aggregation}"

In this case, all variables within the carbon_pools group would have daily means written to cable_output.nc, with the NetCDF names for variables within the group being named using the "{field_name}_{aggregation}" template. A valid module example would be:

modules:
    - name: biogeophysics
      stream: 1
      aggregation: sum

In this case, all variables within the biogeophysics module would have daily sums written to cable_output.nc, with the NetCDF names set according the stream’s netcdf_name. Note that these are purely convenience constructors- an identical result would be achieved by specifying every variable in the carbon_pools group with the same settings in the first case, or every variable (or every group) in the biogeophysics module with the same settings in the second case. The list of possible groups and modules, and their contents, will be provided in the CABLE documentation.

Different level specifications are always additive. There are two rules that must be followed:

  1. There cannot be two variables with the same netcdf_name (after substitution) in the same stream.
  2. Numerically identical variables cannot be directed to the same stream.

An example of a configuration that would violate rule 1 would be an instance where a group, and a variable within that group but with a different aggregation method, are directed to the same stream, like this:

groups:
    - name: carbon_pools
      stream: 1
      aggregation: mean
        
variables:
    - name: labile_carbon
      stream: 1
      aggregation: instant

The carbon_pools group contains labile_carbon, so labile_carbon will have the means and instantaneous values written. NetCDF names must be unique within a file, so the above specification would be illegal without the netcdf_name specification for either carbon_pools or labile_carbon, as there would be both a mean and an instantaneous variable trying to use the labile_carbon NetCDF name.

An example which would violate rule 2 would be an instance where a group and a variable within that group were directed to the same stream with the same aggregation and reduction methods, even if the NetCDF names were unique, like this:

groups:
    - name: carbon_pools
      stream: 1
      aggregation: mean
        
variables:
    - name: labile_carbon
      stream: 1
      aggregation: mean
      netcdf_name: labile_carbon_mean

In this example, both variables have the same aggregation and reduction methods, so the labile_carbon specification would be a numerical duplicate of the labile_carbon specification coming from the carbon_pools specification, even though it would have a different NetCDF name.

Reduction Methods

Reduction methods can be applied to the internal tiled representation of the data within a grid cell. These typically reduce the dimensionality of the data. The available reduction methods are:

  • grid_cell_average: Reduce a variable defined on a per-tile basis to a single value per grid cell, by applying an area-weighted average.
  • first_tile_on_cell: Reduce the variable defined on a per-tile basis to the first tile on the cell. Typically used for writing data that is constant across tiles in a cell e.g. atmospheric forcing.
  • dominant_tile: Reduce the variable defined on a per-tile basis to the domaint tile on the cell.

String Substitution

String substitution is used to programmatically generate strings based on the current context. A set of pre-defined substitution targets can be used as {<substitution target>}, combined with any other characters to build a string with substitution. The available substitution targets are:

  • field_name: The in-built field name for the variable level e.g. gpp for gross primary production, ps for the surface air pressure. For variables that have a CMIP-defined field id, this name will be used.
  • frequency: The frequency at which the aggregation is applied, as defined in the Streams section.
  • aggregation: The aggregation method applied to the variable, as defined in the Variables section.
  • reduction: The reduction method applied to the variable, as defined the in Reduction Methods section.
  • start_date: The start date of the data in YYYY-MM-DD format.
  • end_date: The end date of the data in YYYY-MM-DD format.

Examples

Specifying multiple streams

If we wanted to separate the biogeophysical and biogeochemistry variables, we can create multiple data streams, and direct the respective modules to different streams. The following configuration would direct the monthly means of all biogeophysical variables to cable_biogeophysics_output.nc and the monthly means of all biogeochemistry variables to cable_biogeochemistry_output.nc:

streams:
    1:
        file_name: cable_biogeophysics_output.nc
        frequency: monthly
    
    2:
        file_name: cable_biogeochemistry_output.nc
        frequency: monthly
        
modules:
    - name: biogeophysics
      stream: 1
      aggregation: mean
     
    - name: biogeochemistry
      stream: 2
      aggregation: mean

Adding more output to a stream

We may want to inspect how well the model conserves water and energy. We can direct all variables describing this conservation to the same stream. The following configuration would direct the daily sums of the water_cycle and energy_cycle variables to cable_water_energy.nc:

streams:
    1:
        file_name: cable_water_energy.nc
        frequency: daily
        
groups:
    - name: water_cycle
      stream: 1
      aggregation: sum
      
    - name: energy_cycle
      stream: 1
      aggregation: sum

Saving variables with different frequencies

We may want to investigate the diurnal cycle of a few select variables, while leaving others at a lower frequency. The following configuration would write the monthly means of all the water_cycle variables to cable_water_cycle.nc, with 3 hourly instantaneous values of evaporation and Qle to cable_evaporation.nc:

streams:
    1:
        file_name: cable_water_cycle.nc
        frequency: monthly
  
     2:
         file_name: cable_evaporation_3hrly.nc
         frequency: 3hrly
        
groups:
    - name: water_cycle
      stream: 1
      aggregation: mean

variables:
    - name: evaporation
      stream: 2
      aggregation: instant
      
    - name: Qle
      stream: 2
      aggregation: instant

Directing different aggregations to the same stream

We may want to inspect the instantaneous heat and water fluxes, to compare how they change with the model state, as well as keep track of the summations of the rest of the water and energy variables. The following example would write 3 hourly instantaneous values of evaporation, transpiration, Qh and Qle, as well as the sums of all the variables in the water_cycle and energy_cycle (which are evaporation, transpiration, Qh and Qle). We need to ensure that there are no NetCDF naming conflicts by providing netcdf_name to either the groups or the respective variables (or both).

streams:
    1:
        file_name: cable_water_diurnal.nc
        frequency: 3hrly
        
groups:
    - name: water_cycle
      stream: 1
      aggregation: sum
      netcdf_name: "{field_name}_{aggregation}"
        
    - name: energy_cycle
      stream: 1
      aggregation: sum
      netcdf_name: "{field_name}_{aggregation}"
      
variables:
    - name: evaporation
      stream: 1
      aggregation: instant
      netcdf_name: evaporation_instant

    - name: transpiration
      stream: 1
      aggregation: instant
      netcdf_name: transpiration_instant
      
    - name: Qh
      stream: 1
      aggregation: instant
      netcdf_name: Qh_instant
      
    - name: Qle
      stream: 1
      aggregation: instant
      netcdf_name: Qle_instant
         

Compressing a stream

Data intended for long term storage can be compressed to reduce disk load. The following configuration would apply shuffle compression and level 4 deflation to the output:

streams:
    1:
       file_name: cable_compressed_output.nc
       frequency: daily
       shuffle: true
       compression_level: 4
       
modules:
    - name: biogeophysics
      stream: 1
      aggregation: mean

Splitting a stream into separate files

To split the output into individual files, as is required by the conventions of many MIPs, set the separate_file_per_variable setting in the stream specification. The following example would write all the variables in the carbon_cycle to different files:

streams:
    1:
        file_name: unused_file_name.nc
        frequency: daily
        separate_file_per_variable: true
        netcdf_name: "{field_name}_{aggregation}_{start_date}-{end_date}"

groups:
    - name: water_cycle
      stream: 1
      aggregation: mean

In this case, the file_name is unused, as the individual files are named according to the netcdf_name (with “.nc” appended). The netcdf_name is not explicitly required in this case; if left blank, it will use the default, which is {field_name}. The same rule applies for file names as for NetCDF variable names - there cannot be duplicate generated file names.

3 Likes

Thanks Lachlan, that looks like a very good system.

I just wanted to ask about the variable names. They are often not as obvious as one thinks. For example, I always thought that GPP is rather clear but Georg Wohlfahrt convinced we otherwise (Wohlfahrt and Gu, PCE 2015, https://doi.org/10.1111/pce.12569).

The name of the variable in the code would of course be pretty straightforward but not very explicit (rather cryptic indeed). More explicit names with underscores are used in other projects, like the standard_name in CMIP.

I guess every option might have its pros and cons. So wherever the list of possible variables will be accessible, there should be a column with the variable name in the code, or how it is derived in the code (GPP = -canopy%fpn + canopy%frday).

Usability-wise, would it be better to use a string to name the streams rather than an integer? Assuming you can’t have multiple streams going to the same file you could use the file name to make it clearer where each variable goes.

I just wanted to ask about the variable names. They are often not as obvious as one thinks. For example, I always thought that GPP is rather clear but Georg Wohlfahrt convinced we otherwise (Wohlfahrt and Gu, PCE 2015, https://doi.org/10.1111/pce.12569).

The name of the variable in the code would of course be pretty straightforward but not very explicit (rather cryptic indeed). More explicit names with underscores are used in other projects, like the standard_name in CMIP.

I guess every option might have its pros and cons. So wherever the list of possible variables will be accessible, there should be a column with the variable name in the code, or how it is derived in the code (GPP = -canopy%fpn + canopy%frday).

Yea, the choice of variable names is not an obvious one, but we will ensure that the documentation that we eventually release will make the meaning of the variables clear from both a scientific (what does the variable mean) and technical (what internal variable does it represent) point of view.

Usability-wise, would it be better to use a string to name the streams rather than an integer? Assuming you can’t have multiple streams going to the same file you could use the file name to make it clearer where each variable goes.

It would be quite easy to make the streams be named- technically the “1”, “2” etc. will be interpreted as strings by default. @SeanBryan51 is the one doing the lion’s share of the work to implement this, so I’ll get his thoughts before I promise anything.

I’m happy to go with this approach :slight_smile:

And yes - there cannot be any duplicate file names in the spec (i.e. multiple streams going to the same file).

Also, if we specify the file name for each variable, the streams section could probably be a list structure rather than key-value pairs.

I agree @mcuntz . My point of view here is to make use of the self-describing nature of netCDF to help with this. So instead of improving the variable name itself (as long as it is an acceptable one), I would focus on providing standard_name and long_name by default for each variable. I know it isn’t enough. We could add more description attributes if people have some ideas and as Lachlan says, we will also ensure the documentation is solid and descriptive.

Once we get to it, input on how much documentation is good and what should be in will be very welcome.

Hi all - and thanks for the effort to date. 3 areas that may need a bit of thought.

  1. It wasn’t obvious to me whether there is user-side flexibility in determining which variables can be assigned to which groups - e.g. is the example carbon_pools user determined somewhere in the yaml file, or would changes to its composition require code-side changes? At the moment we do have the equivalent of groups but they are defined in the code and triggered via namelist options (e.g. cable_output%carbon= .true.) Strike that I’ve spotted the note that groups and modules would be provided via the docs - we may need a user feedback session at some point to determine a sensible degree of aggregation though.
  2. spatial vs vector output: There are justifiable reasons why we could want flexibility in whether we output as time x land point, time x grid-cell x tile and time x latitude x longitude x tile (with/without reduction on the tile dimension). My guess is that this is possible using the routines in current HEAD*, but it could be worthwhile building flexibility to decide which way to go into this new output approach as well (likely at the stream level).
  3. future proofing: we have existing developments (C13, POP, POPLUC) and developments in-train (BLAZE, plant hydraulics) where the output (or not all of it) is not handled by the existing routines in cable_output. How simple will it be for users to extend this new approach to the other areas of the code?

*this flexibility exists for the CABLE output and the POP-TRENDY branch has equivalent capability for CASA output.

On 2, are the reasons primarily storage concerns? Honestly, we were hoping that the NetCDF compression would perform better than it did, so we could justify not supporting the compressed grid.

I just did a quick test of how well the compression performs on a naive nveg x nlon x nlat x ntime grid, using the gridinfo file at /g/data/rp23/no_provenance/gridinfo/gridinfo_CSIRO_1x1.nc, which has only 1 tile per grid cell (so 17 x 360 x 150 x 100 grid of Float64 with ~98% sparsity), vs a flattened ntile x ntime grid.

Compression Full grid Flat grid
DeflateLevel=1, shuffle=true 18MB 319kB
DeflateLevel=4, shuffle=true 6.7MB 298kB
DeflateLevel=9, shuffle=true 4.5MB 298kB

Without compression, the size on disk of the full grid would be ~730MB, so the compression has been effective, but we’re still an order of magnitude larger than the flattened grid at DeflateLevel=4, which is barely slower than DeflateLevel=1.