As part of our work improving the usability, performance and maintainability of the CABLE code, we are redesigning the way CABLE handles output. Internally, this means leveraging parallel I/O systems and providing a uniform way of defining output in the code. For users, this means the opportunity for us to provide a much more flexible and powerful API for specifying exactly what you want from your simulations.
Below is our proposal of what that API would look like. What it’s capable of, it’s limitations and ways to achieve common desires. There’s also a fairly plain text PDF (96.8 KB) if you’d prefer to read that. We want to get your feedback before we go ahead with this- are there things you want to achieve that this wouldn’t address, features you’d like to see that aren’t included.
Output Configuration File
The output configuration file is a YAML file, designed to allow flexible output patterns for a range of use cases. The output configuration is centred around two ideas: output streams and output variables. An output stream is linked to a single output file and write frequency (to avoid multiple time axes in a single file), and each output variable is associated with a single stream. As such, the configuration file is split into two sections: streams and variables.
Streams
An output stream directs a set of variables to a given output file. An output configuration can have any number of streams. Each stream must define:
file_name: File name to write the stream to.frequency: The writing frequency, which is the same as the aggregation frequency (see the Variables section for aggregation methods). The available frequencies are:timestep: write on the base model timestep.3hrly: write every 3 hours.daily: write at the end of each day.monthly: write at the end of each month.yearly: write at the end of each year.
These are the only settings required for a minimum stream specification. There are additional optional settings which can be defined:
netcdf_name: The NetCDF variable name template to apply by default to each variable This can be overwritten per variable in thevariablessection. It supports string substitution. Defaults to"{field_name}".shuffle: Whether to apply shuffle compression. Defaults totrue.compression_level: What compression level to apply. Defaults to1.separate_file_per_variable: Whether to write each variable in this stream to an individual file e.g. to match CMIP standards. The file names match thenetcdf_name, so overrides thefile_namespecification. Defaults tofalse.metadata: A sub-dictionary listing global attributes to apply to the NetCDF file. It supports string substitution. Defaults to an empty dictionary.
An example of a minimum valid stream specification would be:
streams:
1:
file_name: cable_output.nc
frequency: daily
In this case, every variable directed to stream 1 would be aggregated and written with daily frequency to cable_output.nc. The NetCDF variable names would be the field names for each variable. An example of a maximum valid stream specification would be:
streams:
1:
file_name: cable_output.nc
frequency: monthly
netcdf_name: "{field_name}_{aggregation_method}"
shuffle: true
compression_level: 4
separate_file_per_variable: false
metadata:
model: CABLE
experiment: example_experiment_id
Here, every variable directed to stream 1 would be aggregated and written with monthly frequency to cable_output.nc, compressed with _Shuffle=True and _DeflateLevel=4. The NetCDF variable names would be the field names followed by the aggregation method. The global metadata would contain the model: "CABLE" and experiment: "example_experiment_id" entries.
Variables
An output variable defines precisely what quantities should be written for the given simulation. The variables are provided in a list format. Each variable must define:
name: The name of the variable being described. This is the same as thefield_name(see String Substitution).stream: The stream to direct the current variable to. Each variable can only be directed to a single stream.aggregation: The aggregation method to apply to the variable. The aggregation period is defined by the frequency of the target stream. The available aggregation methods are:mean: write the per-element average over the period.sum: write the per-element sum over the period.max: write the per-element maximum over the period.min: write the per-element minimum over the period.instant: write the instantaneous state at the time of writing.
These are the only settings required for a minimum variable specification. The full list of variables which can be written will be described in the CABLE documentation. There are additional optional settings which can be defined:
reduction: Which reduction method to apply to the variable. See the Reduction Methods section for information on reduction methods. Defaults tonone.netcdf_name: The NetCDF name to use for the specified variable. It supports string substitution. If specified, overrides the streamnetcdf_name. Defaults to"{field_name}".metadata: A sub-dictionary listing variable attributes to apply to the NetCDF variable. It supports string substitution. There are some attributes that are reserved, which are always applied:standard_name: The CF compliant name where it’s already defined, otherwise a name will be created based on the CF guidelines.long_name: A long name describing the variable.units: Units of the data.cell_methods: The appropriate cell methods for the specified frequency and aggregation method.
An example of a minimum variable specification would be:
variables:
- name: GPP
stream: 1
aggregation: mean
In this instance, the mean of the GPP would be written to stream 1, with the frequency specified in the configuration of stream 1. An example of a maximum valid variable specification would be:
variables:
- name: GPP
stream: 1
aggregation: mean
reduction: none
netcdf_name: GPP_mean
metadata:
description: Computed using X algorithm.
Groups and Modules
To facilitate configuration of many linked variables at once, groups and modules are provided as convenience tools. Each group encompasses a set of variables, and each module encompasses a set of groups. Groups and modules permit all the same settings as a variable configuration, but they apply the settings to all variables within the group or module. An example of a valid group specification would be:
groups:
- name: carbon_pools
stream: 1
aggregation: mean
netcdf_name: "{field_name}_{aggregation}"
In this case, all variables within the carbon_pools group would have daily means written to cable_output.nc, with the NetCDF names for variables within the group being named using the "{field_name}_{aggregation}" template. A valid module example would be:
modules:
- name: biogeophysics
stream: 1
aggregation: sum
In this case, all variables within the biogeophysics module would have daily sums written to cable_output.nc, with the NetCDF names set according the stream’s netcdf_name. Note that these are purely convenience constructors- an identical result would be achieved by specifying every variable in the carbon_pools group with the same settings in the first case, or every variable (or every group) in the biogeophysics module with the same settings in the second case. The list of possible groups and modules, and their contents, will be provided in the CABLE documentation.
Different level specifications are always additive. There are two rules that must be followed:
- There cannot be two variables with the same
netcdf_name(after substitution) in the same stream. - Numerically identical variables cannot be directed to the same stream.
An example of a configuration that would violate rule 1 would be an instance where a group, and a variable within that group but with a different aggregation method, are directed to the same stream, like this:
groups:
- name: carbon_pools
stream: 1
aggregation: mean
variables:
- name: labile_carbon
stream: 1
aggregation: instant
The carbon_pools group contains labile_carbon, so labile_carbon will have the means and instantaneous values written. NetCDF names must be unique within a file, so the above specification would be illegal without the netcdf_name specification for either carbon_pools or labile_carbon, as there would be both a mean and an instantaneous variable trying to use the labile_carbon NetCDF name.
An example which would violate rule 2 would be an instance where a group and a variable within that group were directed to the same stream with the same aggregation and reduction methods, even if the NetCDF names were unique, like this:
groups:
- name: carbon_pools
stream: 1
aggregation: mean
variables:
- name: labile_carbon
stream: 1
aggregation: mean
netcdf_name: labile_carbon_mean
In this example, both variables have the same aggregation and reduction methods, so the labile_carbon specification would be a numerical duplicate of the labile_carbon specification coming from the carbon_pools specification, even though it would have a different NetCDF name.
Reduction Methods
Reduction methods can be applied to the internal tiled representation of the data within a grid cell. These typically reduce the dimensionality of the data. The available reduction methods are:
grid_cell_average: Reduce a variable defined on a per-tile basis to a single value per grid cell, by applying an area-weighted average.first_tile_on_cell: Reduce the variable defined on a per-tile basis to the first tile on the cell. Typically used for writing data that is constant across tiles in a cell e.g. atmospheric forcing.dominant_tile: Reduce the variable defined on a per-tile basis to the domaint tile on the cell.
String Substitution
String substitution is used to programmatically generate strings based on the current context. A set of pre-defined substitution targets can be used as {<substitution target>}, combined with any other characters to build a string with substitution. The available substitution targets are:
field_name: The in-built field name for the variable level e.g.gppfor gross primary production,psfor the surface air pressure. For variables that have a CMIP-defined field id, this name will be used.frequency: The frequency at which the aggregation is applied, as defined in the Streams section.aggregation: The aggregation method applied to the variable, as defined in the Variables section.reduction: The reduction method applied to the variable, as defined the in Reduction Methods section.start_date: The start date of the data inYYYY-MM-DDformat.end_date: The end date of the data inYYYY-MM-DDformat.
Examples
Specifying multiple streams
If we wanted to separate the biogeophysical and biogeochemistry variables, we can create multiple data streams, and direct the respective modules to different streams. The following configuration would direct the monthly means of all biogeophysical variables to cable_biogeophysics_output.nc and the monthly means of all biogeochemistry variables to cable_biogeochemistry_output.nc:
streams:
1:
file_name: cable_biogeophysics_output.nc
frequency: monthly
2:
file_name: cable_biogeochemistry_output.nc
frequency: monthly
modules:
- name: biogeophysics
stream: 1
aggregation: mean
- name: biogeochemistry
stream: 2
aggregation: mean
Adding more output to a stream
We may want to inspect how well the model conserves water and energy. We can direct all variables describing this conservation to the same stream. The following configuration would direct the daily sums of the water_cycle and energy_cycle variables to cable_water_energy.nc:
streams:
1:
file_name: cable_water_energy.nc
frequency: daily
groups:
- name: water_cycle
stream: 1
aggregation: sum
- name: energy_cycle
stream: 1
aggregation: sum
Saving variables with different frequencies
We may want to investigate the diurnal cycle of a few select variables, while leaving others at a lower frequency. The following configuration would write the monthly means of all the water_cycle variables to cable_water_cycle.nc, with 3 hourly instantaneous values of evaporation and Qle to cable_evaporation.nc:
streams:
1:
file_name: cable_water_cycle.nc
frequency: monthly
2:
file_name: cable_evaporation_3hrly.nc
frequency: 3hrly
groups:
- name: water_cycle
stream: 1
aggregation: mean
variables:
- name: evaporation
stream: 2
aggregation: instant
- name: Qle
stream: 2
aggregation: instant
Directing different aggregations to the same stream
We may want to inspect the instantaneous heat and water fluxes, to compare how they change with the model state, as well as keep track of the summations of the rest of the water and energy variables. The following example would write 3 hourly instantaneous values of evaporation, transpiration, Qh and Qle, as well as the sums of all the variables in the water_cycle and energy_cycle (which are evaporation, transpiration, Qh and Qle). We need to ensure that there are no NetCDF naming conflicts by providing netcdf_name to either the groups or the respective variables (or both).
streams:
1:
file_name: cable_water_diurnal.nc
frequency: 3hrly
groups:
- name: water_cycle
stream: 1
aggregation: sum
netcdf_name: "{field_name}_{aggregation}"
- name: energy_cycle
stream: 1
aggregation: sum
netcdf_name: "{field_name}_{aggregation}"
variables:
- name: evaporation
stream: 1
aggregation: instant
netcdf_name: evaporation_instant
- name: transpiration
stream: 1
aggregation: instant
netcdf_name: transpiration_instant
- name: Qh
stream: 1
aggregation: instant
netcdf_name: Qh_instant
- name: Qle
stream: 1
aggregation: instant
netcdf_name: Qle_instant
Compressing a stream
Data intended for long term storage can be compressed to reduce disk load. The following configuration would apply shuffle compression and level 4 deflation to the output:
streams:
1:
file_name: cable_compressed_output.nc
frequency: daily
shuffle: true
compression_level: 4
modules:
- name: biogeophysics
stream: 1
aggregation: mean
Splitting a stream into separate files
To split the output into individual files, as is required by the conventions of many MIPs, set the separate_file_per_variable setting in the stream specification. The following example would write all the variables in the carbon_cycle to different files:
streams:
1:
file_name: unused_file_name.nc
frequency: daily
separate_file_per_variable: true
netcdf_name: "{field_name}_{aggregation}_{start_date}-{end_date}"
groups:
- name: water_cycle
stream: 1
aggregation: mean
In this case, the file_name is unused, as the individual files are named according to the netcdf_name (with “.nc” appended). The netcdf_name is not explicitly required in this case; if left blank, it will use the default, which is {field_name}. The same rule applies for file names as for NetCDF variable names - there cannot be duplicate generated file names.