New WOMBAT evaluation

Details of our approach to evaluate the WOMBAT model.

The following are datasets used to evaluate performance (WOMBAT-lite and WOMBAT-mid). Note that not all datasets will be applicable to WOMBAT-lite. These include surface PO4, surface SiOH4, the fraction of microphytoplankton, and depth-integrated nitrogen fixation rates.




















2 Likes

Looks awesome @pearseb!


Schematic representation of WOMBAT. Tracers and biomass pools are represented by circles of different colours. Components of the ecosystem model, such as nutrients or phytoplankton, are organised within the dashed outlines. WOMBAT-mid includes all tracers and biomass pools (black and red), while WOMBAT-lite only includes those pools outlined in black. (Dinitrogen gas (N2) produced by the denitrification of nitrate (NO3) is not represented.)

To optimise WOMBAT-lite, we ran 256 sensitivity experiments. These experiments explored a range of different values of 19 key input parameters to the biogeochemical model.

WOMBAT-lite was run for 10 years under the JRA55 repeat year forcing (ryf) initialised from observations of nutrients (WOA23), dissolved Fe (PISCES bgc model), oxygen (WOA23), carbon (GLODAP2) and globally uniform phytoplankton, zooplankton and detritus concentrations.

The skill of these experiments relative to 15 observation-based products is shown below:

The inter-experiment standard deviations in these fields are shown below:

The inter-experiment NORMALISED standard deviations are shown below:

Some redundancy in the observation-based products is apparent. NPP is very similar to grazing pressure and thus offers the same information. Chlorophyll and POC datasets too. We therefore focussed on 8 out of these original 15 datasets to assess model performance (see below).

1 Like

Based on the above results, we developed a traffic light system of model evaluation.

The approach is quantitative, involving the univariate metrics of correlation coefficient, mean bias and normalised standard deviations for key variables. The results are then categorized into a simple, easy-to-understand “traffic light” framework:

  • Green: Indicates good performance.
  • Yellow: Indicates acceptable but suboptimal performance.
  • Red: Indicates poor performance.

We first define thresholds for each metric that determine the categorization into green, yellow, or red. This requires knowledge of what is reasonable model skill given the observational product we are comparing too.

Our thresholds for the 8 key observation-based products:

  • surface dissolved iron
  • Oxygen at 250 metres depth
  • surface chlorophyll
  • depth-integrated chlorophyll
  • depth of the chlorophyll maximum
  • depth of the particulate organic carbon maximum
  • depth-integrated NPP (CbPM model)
  • Primary limiting nutrient (Browning & Moore 2023 Nature Communications dataset)

As a first pass, we eliminated all but the model realisations that performed optimally for the nutrient limitiation (LN) data.

This left 32 experiments, all with excellent agreement to Browning & Moore (2023) but with a range of good to poor performance in the other key observations.


In this figure, the first circle marker is the correlation coefficient, the second is the global mean bias, and the third is the normalised standard deviation relative to the given observation-based dataset. The star marker is the overall performance of that model run relative to that observation-based data product. Here, we are ranking the remaining model realisations from best to worst.

The following figure shows the top and worst performing experiments of WOMBAT-lite relative to the key observation-based products we are using to assess performance.

NOTE: World Ocean Atlas O2 in the low oxygen zones tends to be biased high, and the MODIS-based CbPM model is what we use for NPP, which is very high and all models apparently underestimate this.

4 Likes

Wow, this is awesome. My takeaway is the NPP is bad. Iron is better than WOMBAT-old … other things OK??

I think to say that the NPP is bad is to believe that MODIS CbPM productivity model is “true”. Iron is definitely better.

Thanks @pearseb, this looks like a great way to assess the model.

I was wondering

  • how the model initial condition looks under these metrics
  • whether you’d expect the rank order to change much in a longer model run
  • whether rate of drift relative to initial condition should also be a performance metric?

I noticed some of the ACCESS-EMS1.5 values in Ziehn et al 2020 (PI, quadratic phy mortality, prey capture efficiency) lie outside the ranges in your parameter survey - is this because the equations or units are different?