Community Talk: Heather Rumbold (Met Office, UK)
Developing the next standard configuration for standalone JULES using a benchmarking system based on ModelEvaluation.org
Abstract
Benchmarking of land surface models (LSMs) involves adopting widely agreed standards for judging performance. Unlike evaluation or validation, benchmarking requires comparison of outputs with pre-defined targets or thresholds, allowing meaningful inter-comparisons of independent models or different configurations of a single model.
The benchmarking approach can be used to determine the suitability of new model configurations by comparison with the performance of previous configurations of the same model. The work developed here assesses new components of JULES for use in future model configurations, using a benchmarking system based on the ModelEvaluation.org web application. The configurations generated have been run through a newly developed benchmarking suite that uses predefined metrics and previous standard configurations as benchmarks. A workflow has been developed that runs JULES for all the available FLUXNET sites and utilizes existing meteorological driving data from PLUMBER2 (PALS Land Surface Model Benchmarking Evaluation Project 2). The suite then calculates statistical metrics for every site, variable, model configuration and benchmark. Each configuration is ranked relative to the benchmarks and these rankings are averaged over all statistics and sites to give an average ranking for the variables separately. A final averaging is performed over all variables to give an overall ranking. This method allows a clear comparison to be made on the performance of the new configuration relative to the benchmarks.
This talk will outline the development cycle used to generate the new standard standalone JULES configurations and demonstrate how the benchmarks have been used to assess the suitability of new science code.
Please use this thread for further discussion on this talk.
I think one of your on-site questioners raised this issue: risk of when one is getting the “right” answer for the wrong reason e.g. compensating effects of different “wrong” model physics components could lead to “better” physics being rejected?
Hi Heather,
Thanks for your talk! Have you assessed whether previous model upgrades (in the pre-benchmarking era) would have been adopted if the benchmarking tests were applied back then?
Thank you for your question Kim_Reid! We haven’t because the previous model upgrades used for the benchamarks are all global model configurations that have been adapted for the standalone environment. The science that went into these were tested and benchmarked within the global coupled environment and are therefore not necessarily appropriate for the standalone context. However, they do still provide a useful benchmark because we are aware of their limitations for standalone simulations. What we have done here is create the first standalone configuration which uses science which is designed for a land only context and represents the best cutting edge land science we can achieve right now. Our first iteration will become the next benchmark and the performace of all future science can be benchmarked against this.
rolandwarner - Thanks for your question! Yes there is a risk and it’s not alway easy to identify the compensating effects. This is why we don’t necessarily reject science changes that have poor benchmarking results. An important part of the benchmark analysis is to try and unpick the output from each science change. We ask users to provide their own testing and demonstrate that the change is well understood before it can be accepted for package testing. If this happens at the package testing level then the unpicking can become more tricky and tickets will have to be rejected if justification can’t be indentified. This is our first configuration cycle so there are lessons to learn and scope to improve the process going forward.