Issues with ACCESS-rAM: ocean domains (u-dg767) and NaNs in BICGstab (u-dg768)

Hi,

I’m new to running ACCESS models and would like to report a couple of issues I’ve experienced with the recent beta release of ACCESS-rAM3. These were encountered after moving the domain eastward to cover parts of the Tasman Sea and New Zealand. After working around these issues I’ve managed to successfully run a case for three days over the below domain, where the inner domain covers parts of New Caledonia:

Issue 1: RAS (u-dg767) crashing over a purely ocean domain
Moving the inner nest southwards so that it lies purely over the ocean results in errors with the cap_vegfrac, land and soils_hydr ancil tasks.

The relevant parts of the log files seem to be:

job.out for ancil_cap_vegfrac:

 Calculating bi-linear interpolation coeffs
Finding coastal points
Setting coastal values
 WARNING - No source data is available in target domain
UNRESOLVED GRID POINTS IN SOIL DATASET
 Number of points unresolved is                      9
 POINT      78674 LAT   -29.0100 LONG   167.9304
 POINT      78675 LAT   -29.0100 LONG   167.9502
 POINT      79124 LAT   -29.0298 LONG   167.9304
 POINT      79125 LAT   -29.0298 LONG   167.9502
 POINT      79126 LAT   -29.0298 LONG   167.9700
 POINT      79127 LAT   -29.0298 LONG   167.9898
 POINT      79574 LAT   -29.0496 LONG   167.9304
 POINT      79575 LAT   -29.0496 LONG   167.9502
 POINT      79576 LAT   -29.0496 LONG   167.9700
 Search radius                      1
 NO DATA FROM WHICH TO SET UNRESOLVED POINTS
 ***ERROR: No source data available in target domain

job.err for ancil_land, with ancil_soils_hydr pretty much having the same issue:

Loading cylc7/23.09
  Loading requirement: mosrs-setup/1.0.1
Traceback (most recent call last):
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 165, in <module>
    _run_app()
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 152, in _run_app
    main(
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/bin/ancil_general_regrid.py", line 123, in main
    ants.analysis.make_consistent_with_lsm(
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/__init__.py", line 508, in make_consistent_with_lsm
    filler = Filler(cube, target_mask=mask)
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/_merge.py", line 835, in __init__
    self._call_spiral_search(source)
  File "/home/565/cr7888/cylc-run/u-dg767/src/ants/lib/ants/analysis/_merge.py", line 890, in _call_spiral_search
    raise ValueError(msg)
ValueError: The provided source doesn't appear to have any valid data.
[FAIL] python_env ancil_general_regrid.py --ants-config ${ANTS_CONFIG} \
[FAIL] ${source} --target-lsm ${target_lsm} -o ${output} # return-code=1
2025-04-11T06:27:13Z CRITICAL - failed/EXIT

This seems to be somewhat similar to the issues discussed in AUS2200 vegetation fraction ancil creation issues except there the issues seem to be associated with land regions such as New Zealand rather than the lack of land. I guess maybe the suite is looking for land data that doesn’t exist over a pure ocean domain?

Issue 2: RNS (u-dg768) crashes depending on start date

When I change the start date of the simulation from 2018-01-03 to 2018-01-02, the model crashes during the first forecast cycle at d1000 resolution. No other changes were made to either suites, so I’m really not sure why one works fine and the other doesn’t.

In the job.out log file I see the following error output:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: NaNs in error term in BiCGstab after      1 iterations
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 216
?  Error number: 22
????????????????????????????????????????????????????????????????????????????????

The link to the UM wiki writes the following about this error:

NaNs in error term in BiCGstab

Why?: This is usually a catch all failure point where a NaN has been generated in a physics scheme (or read in from a corrupt input file) and has subsequently been passed to the dynamics.

How to investigate?: Run the model with output diagnostics set to high ([env]PRINT_STATUS=PrStatus_Diag) as this switches on the summary information for physics increments. This will identify if a NaN has been generated by a physics scheme and allows you to narrow down where the problem is.

I’ve tried following this advice for output diagnostics by going to um -> env -> Runtime Controls -> Atmosphere only in the rose GUI and changing PRINT_STATUS to “Extra diagnostic messages” but I’m just getting the same message come up in the log files (job.err is a complete mess with the same message coming up countless times).

Any help for either of these issues would be much appreciated. Thanks!

N.B. I will be overseas for most of May so nothing is super urgent – will be spending more time on this after getting back.

2 Likes

Hi Cory

I’ve created a notebook which scans for NaNs in ancillaries that lie outside of a land-sea mask.

If you problems are caused by moving your domain into a region where the reconfiguration incorrectly generates NaNs over land this might help you track down this issue.

Have a look here - see if you can run it on your experiment. The notebook was created from the AUS2200 issues you linked to earlier.

Is this suite using CABLE or JULES to generate land ancillaries?

I’m pretty sure the UM needs at least one land point to run - when I set up aqua planets ages ago you needed to put a land point on the pole to get it to run. That might have changed though.

2 Likes

Thanks both. I tried running the notebook and there weren’t any bad ancil files, so I think it might be as Scott suggested in that the model needs at least one land point to run.

Would there be any workarounds to this? I’m still not 100% sure on the exact domains I will be using, but it would be nice if it could work over full ocean domains.

Try pausing the ancil suite after the land mask has been created and flick one of the grid points in the corner over to land in the mask and fraction ancil files with mule. set orog for the grid point to something small. Continue the workflow and see how far it gets.

Actually, it turns out that the domain is not quite completely ocean, as there’s a few land grid points I only noticed after zooming in. The vegfrac ancil files seem to be consistent with this too. However, the qrclim.land and qrparm.soil are still the old version with the domain further north (these correspond to the tasks that failed in the suite).

Edit: it looks like the 9 land points at d0198 resolution correspond to the 9 points listed in the job.out log for ancil_cap_vegfrac (see my first post). Perhaps there’s missing data from the soil dataset being accessed?

Edit 2: Okay, I think this is the same issue as previously discussed for Aus2200, given the same areas are coming up in Paul’s ancil notebook (turns out I just forgot to change resolution to d0198!). I’m pretty sure this is Norfolk Island. Will check out the suggestions there and see if I can get it to work.

For the BICGstab issue, I will try re-running the model with a reduced time step since I’ve heard that can sometimes fix things. If that doesn’t work then I’ll report back here.

2 Likes

Thanks for the update, Corey.
Before you go away, if you post the paths to the two cylc directories in question and make sure they have read permissions set throughout the depth of the directories (execute chmod -R a+r on each top-level directory) then other people can have a look at the log files while you’re away.

1 Like

Thanks Bethan. The cylc directories for both u-dg767 and u-dg768 are in /scratch/k10/cr7888/cylc-run/ and should have read permissions now (let me know if you’d like me to move them to a different project since I don’t think many people are in k10). For both suites the most recent log files should be in the log.202504...Z folder.

As Bethan mentioned, I am away for the next few weeks so will be limited in the amount of work I can do.

Enjoy your Easter everyone :smiley:

1 Like