Issues with ACCESS-rAM: ocean domains (u-dg767) and NaNs in BICGstab (u-dg768)

coreyrobinson · 16 April 2025 02:20

Hi,

I’m new to running ACCESS models and would like to report a couple of issues I’ve experienced with the recent beta release of ACCESS-rAM3. These were encountered after moving the domain eastward to cover parts of the Tasman Sea and New Zealand. After working around these issues I’ve managed to successfully run a case for three days over the below domain, where the inner domain covers parts of New Caledonia:

Issue 1: RAS (u-dg767) crashing over a purely ocean domain
Moving the inner nest southwards so that it lies purely over the ocean results in errors with the cap_vegfrac, land and soils_hydr ancil tasks.

The relevant parts of the log files seem to be:

job.out for ancil_cap_vegfrac:

 Calculating bi-linear interpolation coeffs
Finding coastal points
Setting coastal values
 WARNING - No source data is available in target domain
UNRESOLVED GRID POINTS IN SOIL DATASET
 Number of points unresolved is                      9
 POINT      78674 LAT   -29.0100 LONG   167.9304
 POINT      78675 LAT   -29.0100 LONG   167.9502
 POINT      79124 LAT   -29.0298 LONG   167.9304
 POINT      79125 LAT   -29.0298 LONG   167.9502
 POINT      79126 LAT   -29.0298 LONG   167.9700
 POINT      79127 LAT   -29.0298 LONG   167.9898
 POINT      79574 LAT   -29.0496 LONG   167.9304
 POINT      79575 LAT   -29.0496 LONG   167.9502
 POINT      79576 LAT   -29.0496 LONG   167.9700
 Search radius                      1
 NO DATA FROM WHICH TO SET UNRESOLVED POINTS
 ***ERROR: No source data available in target domain

job.err for ancil_land, with ancil_soils_hydr pretty much having the same issue:

Loading cylc7/23.09
  Loading requirement: mosrs-setup/1.0.1
Traceback (most recent call last):
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 165, in <module>
    _run_app()
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 152, in _run_app
    main(
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 123, in main
    ants.analysis.make_consistent_with_lsm(
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/__init__.py", line 508, in make_consistent_with_lsm
    filler = Filler(cube, target_mask=mask)
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/_merge.py", line 835, in __init__
    self._call_spiral_search(source)
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/_merge.py", line 890, in _call_spiral_search
    raise ValueError(msg)
ValueError: The provided source doesn't appear to have any valid data.
[FAIL] python_env ancil_general_regrid.py --ants-config ${ANTS_CONFIG} \
[FAIL] ${source} --target-lsm ${target_lsm} -o ${output} # return-code=1
2025-04-11T06:27:13Z CRITICAL - failed/EXIT

This seems to be somewhat similar to the issues discussed in AUS2200 vegetation fraction ancil creation issues except there the issues seem to be associated with land regions such as New Zealand rather than the lack of land. I guess maybe the suite is looking for land data that doesn’t exist over a pure ocean domain?

Issue 2: RNS (u-dg768) crashes depending on start date

When I change the start date of the simulation from 2018-01-03 to 2018-01-02, the model crashes during the first forecast cycle at d1000 resolution. No other changes were made to either suites, so I’m really not sure why one works fine and the other doesn’t.

In the job.out log file I see the following error output:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: NaNs in error term in BiCGstab after      1 iterations
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 216
?  Error number: 22
????????????????????????????????????????????????????????????????????????????????

The link to the UM wiki writes the following about this error:

NaNs in error term in BiCGstab

Why?: This is usually a catch all failure point where a NaN has been generated in a physics scheme (or read in from a corrupt input file) and has subsequently been passed to the dynamics.

How to investigate?: Run the model with output diagnostics set to high ([env]PRINT_STATUS=PrStatus_Diag) as this switches on the summary information for physics increments. This will identify if a NaN has been generated by a physics scheme and allows you to narrow down where the problem is.

I’ve tried following this advice for output diagnostics by going to um -> env -> Runtime Controls -> Atmosphere only in the rose GUI and changing PRINT_STATUS to “Extra diagnostic messages” but I’m just getting the same message come up in the log files (job.err is a complete mess with the same message coming up countless times).

Any help for either of these issues would be much appreciated. Thanks!

N.B. I will be overseas for most of May so nothing is super urgent – will be spending more time on this after getting back.

Paul.Gregory · 16 April 2025 02:27

Hi Cory

I’ve created a notebook which scans for NaNs in ancillaries that lie outside of a land-sea mask.

If you problems are caused by moving your domain into a region where the reconfiguration incorrectly generates NaNs over land this might help you track down this issue.

Have a look here - see if you can run it on your experiment. The notebook was created from the AUS2200 issues you linked to earlier.

github.com/21centuryweather/UM_configuration_tools

Check_UM_ancillaries.ipynb

main

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b8720a51-221c-4d45-a444-df2d35c3fd73",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/ants/regrid/esmf.py:26: UserWarning:  No module named 'ESMF'\n",
      "Proceeding without capabilities provided by ESMPy (ESMF).\n",
      "  warnings.warn(msg.format(str(_ESMF_IMPORT_ERROR)))\n",
      "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/ants/regrid/_ugrid.py:19: UserWarning:  No module named 'ESMF'\n",
      "Proceeding without capabilities provided by ESMPy (ESMF).\n",
      "  warnings.warn(msg.format(str(_ESMF_IMPORT_ERROR)))\n",
      "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/iris/experimental/raster.py:29: IrisDeprecation: iris.experimental.raster is deprecated since version 3.2, and will be removed in a future release. If you make use of this functionality, please contact the Iris Developers to discuss how to retain it (which may involve reversing the deprecation).\n",
      "  warn_deprecated(wmsg)\n"

This file has been truncated. show original

Is this suite using CABLE or JULES to generate land ancillaries?

Scott · 16 April 2025 03:13

I’m pretty sure the UM needs at least one land point to run - when I set up aqua planets ages ago you needed to put a land point on the pole to get it to run. That might have changed though.

coreyrobinson · 16 April 2025 04:46

Thanks both. I tried running the notebook and there weren’t any bad ancil files, so I think it might be as Scott suggested in that the model needs at least one land point to run.

Would there be any workarounds to this? I’m still not 100% sure on the exact domains I will be using, but it would be nice if it could work over full ocean domains.

Scott · 16 April 2025 04:56

Try pausing the ancil suite after the land mask has been created and flick one of the grid points in the corner over to land in the mask and fraction ancil files with mule. set orog for the grid point to something small. Continue the workflow and see how far it gets.

coreyrobinson · 17 April 2025 00:00

Actually, it turns out that the domain is not quite completely ocean, as there’s a few land grid points I only noticed after zooming in. The vegfrac ancil files seem to be consistent with this too. However, the qrclim.land and qrparm.soil are still the old version with the domain further north (these correspond to the tasks that failed in the suite).

Edit: it looks like the 9 land points at d0198 resolution correspond to the 9 points listed in the job.out log for ancil_cap_vegfrac (see my first post). Perhaps there’s missing data from the soil dataset being accessed?

Edit 2: Okay, I think this is the same issue as previously discussed for Aus2200, given the same areas are coming up in Paul’s ancil notebook (turns out I just forgot to change resolution to d0198!). I’m pretty sure this is Norfolk Island. Will check out the suggestions there and see if I can get it to work.

For the BICGstab issue, I will try re-running the model with a reduced time step since I’ve heard that can sometimes fix things. If that doesn’t work then I’ll report back here.

bethanwhite · 17 April 2025 06:31

Thanks for the update, Corey.
Before you go away, if you post the paths to the two cylc directories in question and make sure they have read permissions set throughout the depth of the directories (execute chmod -R a+r on each top-level directory) then other people can have a look at the log files while you’re away.

coreyrobinson · 17 April 2025 10:21

Thanks Bethan. The cylc directories for both u-dg767 and u-dg768 are in /scratch/k10/cr7888/cylc-run/ and should have read permissions now (let me know if you’d like me to move them to a different project since I don’t think many people are in k10). For both suites the most recent log files should be in the log.202504...Z folder.

As Bethan mentioned, I am away for the next few weeks so will be limited in the amount of work I can do.

Enjoy your Easter everyone

Aidan · 24 April 2025 06:51

Thanks for beta testing the suite @coreyrobinson and reporting your problem here on the forum. Apologies for the delay, we should have responded earlier.

We’re resource constrained at the moment, and in particular it is @cbengel who is best placed to investigate your issue but is on extended leave and not due back until mid-June.

We think it’s likely you’re correct with your supposition that Issue 1 is related to a lack of appropriate ancillary data in the few land points in the inner nest. This is an issue that needs suite modifications and testing to ensure it doesn’t create more problems than it solves.

Regarding Issue 2:

Is this for the default Lismore case?

I have repro’ed the error for the default case (Lismore):

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: NaNs in error term in BiCGstab after      1 iterations
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 0
?  Error number: 22
????????????????????????????????????????????????????????????????????????????????

with the following changes to u-dg768 (no changes to u-dg767):

$ svn diff
Index: rose-suite.conf
===================================================================
--- rose-suite.conf	(revision 317353)
+++ rose-suite.conf	(working copy)
@@ -7,11 +7,11 @@
 CRUN_LEN=6
 CYCLE_INT_HR=24
 EXEC_DIR="/g/data/vk83/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v4/intel-19.0.3.199/um-13.0-6bjk4pka3bpysrmvwh2obd2gezhejfvw"
-FINAL_CYCLE_POINT="20220227T0000Z"
+FINAL_CYCLE_POINT="20180103T0000Z"
 FREE_RUN=true
 HK_RUN=false,false
 !!HPC_USERID="please_specify"
-INITIAL_CYCLE_POINT="20220226T0000Z"
+INITIAL_CYCLE_POINT="20180102T0000Z"
 KGO_TEST=false
 !!LFRIC_EXEC_DIR="<path to lfric model executable>"
 MAKE_FRAMES=false

Stack trace:

Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
?  Error from routine: EG_BICGSTAB
?  Error message: NaNs in error term in BiCGstab after      1 iterations
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 2
?  Error number: 22
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   00001521F623F555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  00001521F36ACD10  Unknown               Unknown  Unknown
libpthread-2.28.s  00001521F36ABA35  __write               Unknown  Unknown
libifcoremt.so.5   00001521F62532BC  for__write_output     Unknown  Unknown
libifcoremt.so.5   00001521F62541E6  for__put_sf           Unknown  Unknown
libifcoremt.so.5   00001521F62961EA  for_write_seq_fmt     Unknown  Unknown
libifcoremt.so.5   00001521F6294EE5  for_write_seq_fmt     Unknown  Unknown
um-atmos.exe       00000000004116EA  umprintmgr_mp_ump         566  umprintmgr.F90
um-atmos.exe       000000000042247D  ereport_mod_mp_er         204  ereport_mod.F90
um-atmos.exe       00000000011DF624  eg_bicgstab_mod_m         406  eg_bicgstab.F90
um-atmos.exe       00000000011231E1  eg_sl_helmholtz_m         330  eg_sl_helmholtz.F90
um-atmos.exe       0000000000C5878C  atm_step_4a_mod_m        3121  atm_step_4A.F90
um-atmos.exe       00000000004F0894  u_model_4a_mod_mp         386  u_model_4A.F90
um-atmos.exe       000000000040CA38  um_shell_mod_mp_u         748  um_shell.F90
um-atmos.exe       00000000004093F8  MAIN__                     60  um_main.F90
um-atmos.exe       00000000004093A2  Unknown               Unknown  Unknown
libc-2.28.so       00001521F32FE7E5  __libc_start_main     Unknown  Unknown
um-atmos.exe       00000000004092AE  Unknown               Unknown  Unknown

We’re investigating this failure and will hopefully have something to report early next week.

Aidan · 24 April 2025 07:09

Doing some initial testing to see if I can narrow down the cause of the crash

coreyrobinson · 24 April 2025 15:07

Thanks Aidan for looking into this. I’m currently at the University of Leeds and one thing I’ve heard is that sometimes changing the time of day of initialisation (i.e. particular times in the diurnal cycle) can change things, so not sure if that is relevant here.

No, this was for the case I was looking around the Tasman Sea area.

Aidan · 28 April 2025 02:40

We did some testing with the default Lismore configuration and found the following:

INITIAL_CYCLE_POINT="20180102T0000Z": Suite crashed with NaNs in error term in BiCGstab after 1 iterations
INITIAL_CYCLE_POINT="20180101T0000Z": Suite ran, no error. Note that FINAL_CYCLE_POINT remained unchanged.
INITIAL_CYCLE_POINT="20190102T0000Z": Suite ran, no error
INITIAL_CYCLE_POINT="20170602T0000Z": Suite ran, no error

In summary: starting a day earlier ran without error, including Jan 2nd. Starting from an much earlier date (six months before), and the same day a year later both ran without error, which seems to rule out a systemic problem with selecting different suite start times.

Our current working hypothesis is the crash is likely due to the ERA5 boundary conditions producing a numerically unstable nest configuration. This is unlikely to be something we could look into solving in the near term, so the current advice would be to choose a different INITIAL_CYCLE_POINT if that is possible.

@paulleopardi is also testing the effect of reducing the time step as another possible work-around.

paulleopardi · 28 April 2025 06:31

Reducing the GAL9 timestep rg01_rs01_m01_dt from 300 to 150 works.

Index: rose-suite.conf
===================================================================
--- rose-suite.conf	(revision 317353)
+++ rose-suite.conf	(working copy)
@@ -10 +10 @@
-FINAL_CYCLE_POINT="20220227T0000Z"
+FINAL_CYCLE_POINT="20180103T0000Z"
@@ -14 +14 @@
-INITIAL_CYCLE_POINT="20220226T0000Z"
+INITIAL_CYCLE_POINT="20180102T0000Z"
@@ -68 +68 @@
-rg01_rs01_m01_dt=300
+rg01_rs01_m01_dt=150

paulleopardi · 28 April 2025 07:29

Since the GAL9 tasks are releatively short, this change of timestep has very little effect on the overall runtime of u-dg768.

Topic		Replies	Views
Beta-release of ACCESS-rAM3 now available! Regional Nesting Suite model-release , access-ram	0	29	30 March 2025
ACCESS-rAM3: Release Information ACCESS-NRI Releases release , model , access-ram3	2	253	29 April 2025
ACCESS-rAM3 Release 1.0 Feedback Regional Nesting Suite regional , atmosphere , feedback , access-ram , access-ram3	14	98	9 May 2025
ACCESS-rAM3 Beta Feedback Regional Nesting Suite regional , atmosphere , feedback , beta , access-ram	26	206	1 May 2025
ACCESS-rAM3 'Flagship' Experiments Regional Nesting Suite	13	114	19 June 2025

Issues with ACCESS-rAM: ocean domains (u-dg767) and NaNs in BICGstab (u-dg768)

NaNs in error term in BiCGstab

Related topics