Community Talks 2: Heather Rumbold (Met Office, UK) Developing the next standard configuration for standalone JULES using a benchmarking system based on ModelEvaluation.org

heathersrumbold · 30 August 2024 08:16

Community Talk: Heather Rumbold (Met Office, UK)

Developing the next standard configuration for standalone JULES using a benchmarking system based on ModelEvaluation.org

Abstract

Benchmarking of land surface models (LSMs) involves adopting widely agreed standards for judging performance. Unlike evaluation or validation, benchmarking requires comparison of outputs with pre-defined targets or thresholds, allowing meaningful inter-comparisons of independent models or different configurations of a single model.
The benchmarking approach can be used to determine the suitability of new model configurations by comparison with the performance of previous configurations of the same model. The work developed here assesses new components of JULES for use in future model configurations, using a benchmarking system based on the ModelEvaluation.org web application. The configurations generated have been run through a newly developed benchmarking suite that uses predefined metrics and previous standard configurations as benchmarks. A workflow has been developed that runs JULES for all the available FLUXNET sites and utilizes existing meteorological driving data from PLUMBER2 (PALS Land Surface Model Benchmarking Evaluation Project 2). The suite then calculates statistical metrics for every site, variable, model configuration and benchmark. Each configuration is ranked relative to the benchmarks and these rankings are averaged over all statistics and sites to give an average ranking for the variables separately. A final averaging is performed over all variables to give an overall ranking. This method allows a clear comparison to be made on the performance of the new configuration relative to the benchmarks.
This talk will outline the development cycle used to generate the new standard standalone JULES configurations and demonstrate how the benchmarks have been used to assess the suitability of new science code.

Please use this thread for further discussion on this talk.

rolandwarner · 4 September 2024 00:00

I think one of your on-site questioners raised this issue: risk of when one is getting the “right” answer for the wrong reason e.g. compensating effects of different “wrong” model physics components could lead to “better” physics being rejected?

Kim_Reid · 4 September 2024 00:05

Hi Heather,
Thanks for your talk! Have you assessed whether previous model upgrades (in the pre-benchmarking era) would have been adopted if the benchmarking tests were applied back then?

heathersrumbold · 4 September 2024 01:51

Thank you for your question Kim_Reid! We haven’t because the previous model upgrades used for the benchamarks are all global model configurations that have been adapted for the standalone environment. The science that went into these were tested and benchmarked within the global coupled environment and are therefore not necessarily appropriate for the standalone context. However, they do still provide a useful benchmark because we are aware of their limitations for standalone simulations. What we have done here is create the first standalone configuration which uses science which is designed for a land only context and represents the best cutting edge land science we can achieve right now. Our first iteration will become the next benchmark and the performace of all future science can be benchmarked against this.

heathersrumbold · 4 September 2024 02:07

rolandwarner - Thanks for your question! Yes there is a risk and it’s not alway easy to identify the compensating effects. This is why we don’t necessarily reject science changes that have poor benchmarking results. An important part of the benchmark analysis is to try and unpick the output from each science change. We ask users to provide their own testing and demonstrate that the change is well understood before it can be accepted for package testing. If this happens at the package testing level then the unpicking can become more tricky and tickets will have to be rejected if justification can’t be indentified. This is our first configuration cycle so there are lessons to learn and scope to improve the process going forward.

Topic		Replies	Views
JULES JLMP notes Land Surface working-group	0	213	11 April 2023
Benchmarking CABLE and JULES Working Group	0	229	16 May 2023
New Canadian land surface model and evaluation framework Land Surface	4	297	21 December 2022
Poster: benchcab: a testing framework for the CABLE land surface model Posters model-evaluation , land	2	192	30 August 2024
Hydro-JULES webinar announce Land Surface jules , dataassimilation , hydrology , webinar	1	188	30 October 2023

Community Talks 2: Heather Rumbold (Met Office, UK) Developing the next standard configuration for standalone JULES using a benchmarking system based on ModelEvaluation.org

Related topics