Details of our approach to evaluate the WOMBAT model.
The following are datasets used to evaluate performance (WOMBAT-lite and WOMBAT-mid). Note that not all datasets will be applicable to WOMBAT-lite. These include surface PO4, surface SiOH4, the fraction of microphytoplankton, and depth-integrated nitrogen fixation rates.
Schematic representation of WOMBAT. Tracers and biomass pools are represented by circles of different colours. Components of the ecosystem model, such as nutrients or phytoplankton, are organised within the dashed outlines. WOMBAT-mid includes all tracers and biomass pools (black and red), while WOMBAT-lite only includes those pools outlined in black. (Dinitrogen gas (N2) produced by the denitrification of nitrate (NO3) is not represented.)
To optimise WOMBAT-lite, we ran 256 sensitivity experiments. These experiments explored a range of different values of 19 key input parameters to the biogeochemical model.
WOMBAT-lite was run for 10 years under the JRA55 repeat year forcing (ryf) initialised from observations of nutrients (WOA23), dissolved Fe (PISCES bgc model), oxygen (WOA23), carbon (GLODAP2) and globally uniform phytoplankton, zooplankton and detritus concentrations.
The skill of these experiments relative to 15 observation-based products is shown below:
Some redundancy in the observation-based products is apparent. NPP is very similar to grazing pressure and thus offers the same information. Chlorophyll and POC datasets too. We therefore focussed on 8 out of these original 15 datasets to assess model performance (see below).
Based on the above results, we developed a traffic light system of model evaluation.
The approach is quantitative, involving the univariate metrics of correlation coefficient, mean bias and normalised standard deviations for key variables. The results are then categorized into a simple, easy-to-understand “traffic light” framework:
Green: Indicates good performance.
Yellow: Indicates acceptable but suboptimal performance.
Red: Indicates poor performance.
We first define thresholds for each metric that determine the categorization into green, yellow, or red. This requires knowledge of what is reasonable model skill given the observational product we are comparing too.
Our thresholds for the 8 key observation-based products:
surface dissolved iron
Oxygen at 250 metres depth
surface chlorophyll
depth-integrated chlorophyll
depth of the chlorophyll maximum
depth of the particulate organic carbon maximum
depth-integrated NPP (CbPM model)
Primary limiting nutrient (Browning & Moore 2023 Nature Communications dataset)
This left 32 experiments, all with excellent agreement to Browning & Moore (2023) but with a range of good to poor performance in the other key observations.
In this figure, the first circle marker is the correlation coefficient, the second is the global mean bias, and the third is the normalised standard deviation relative to the given observation-based dataset. The star marker is the overall performance of that model run relative to that observation-based data product. Here, we are ranking the remaining model realisations from best to worst.
The following figure shows the top and worst performing experiments of WOMBAT-lite relative to the key observation-based products we are using to assess performance.
NOTE: World Ocean Atlas O2 in the low oxygen zones tends to be biased high, and the MODIS-based CbPM model is what we use for NPP, which is very high and all models apparently underestimate this.
I noticed some of the ACCESS-EMS1.5 values in Ziehn et al 2020 (PI, quadratic phy mortality, prey capture efficiency) lie outside the ranges in your parameter survey - is this because the equations or units are different?