MOM6 Error | btstep: eta has dropped below bathyT

Running MOM6 on Setonix, starting from 1970-01-01, with IAF forcing, I receive this warming/error combo:

WARNING from PE  2726: btstep: eta has dropped below bathyT:  -2.0273254780607175E+02 vs.  -1.6440260515351190E+02 at  -2.0707E+02 -3.2295E+01    377    581
WARNING from PE  2851: MOM_diabatic_aux.F90, applyBoundaryFluxesInOut(): Mass created. x,y,dh=      -2.064E+02     -3.115E+01     -9.905E-05
FATAL from PE  2851: MOM_regridding: adjust_interface_motion() - implied h<0 is larger than roundoff!

The first warning, which if I’m interpreting it correctly, suggests that the surface height (eta) value is -202m compared to the depth which is -164m.

The second warning, which also originates from the same interior location, is showing that heat_content_massout < 0 (in some cases this is on the order of -10E+01). Is there mass being removed because the ocean surface is outcropping on the seafloor?

Then finally, things get too extreme I guess and the Fatal Error occurs.

Things to note:

  1. This instability is originating well away from the boundary in relatively shallow water.
  2. From the mom6.err, it looks like this instability propagates away from this initial source, with amplitudes growing ridiculously (at some points amplitudes exceed 10E+40 m).
  3. The error occured after running fine for ~100 days.
  4. Baroclinic timestep is already quite short - DT = 300 | DT_THERM = 600

Places I’ve seen this pop up in the past:

  • @ashjbarnes and I have seen this happen when there was no “buffer region” in the unprocessed segments at the boundary for the interpolation of forcing segments - but if I remember correctly, the instabilities originated at the boundaries, whereas here, it’s in the interior.

  • I noticed in the mom6/panan wiki, this same warning/error is described as a result of having NaNs as the _FillValue attribute inforcing fields.

  • In this MOM6-examples github issue - where it might occur due to “an uninitialised variable”/ bug somewhere in source code.

In all of these cases, the error occurred shortly after initialisation, whereas here it occurs after 100 days so I think it’s unlikely to be associated with any of these.

Any ideas what else could cause this? I should note, the warnings originate in the EAC separation region that has extremely large surface height fluctuations due to the eddy field. Also, the vertical layers start with a thickness of ~0.5 m at the surface. Could this occur if eta > dz at surface?


UPDATE: halving the DT and DT_THERM has prevented the error happening on the date it was occurring previously. However now the timesteps are: DT=150, DT_THERM=300, which has resulted in almost doubling the runtime.

The model resolution is 30th degree - are there any other factors that can increase stability other than shortening the above timesteps?

Do you have any luck with keeping DT=300 as it was, but also setting DT_THERM=300? Sounds like you weren’t hitting a CFL limit before, so it shouldn’t be necessary to drop the baroclinic timestep? There were issues with panan when DT_THERM was too long.

Thanks for the suggestion Angus. The CFL was pushing up past 0.2 at times, but I’ll give this a go once the current month finishes and let you know.

0.2 is quite small, the hard cap on CFL is 0.5

Ah okay that’s good to know. Well, things are running smoothly for now with both timesteps at 300. Keeping DT at 300 has reduced the runtime by a third compared to having DT at 150. Thanks again!

1 Like

On this issue, I’m hitting particular months where with DT=300, DT_THERM=300, I still get the error noted in the title of this post. This might be one in every 6 months I get this error. For these months, I’m then reducing both timesteps to 150. This is both extremely costly, and from reading back through the panan-01 work testing DT_THERM, it seems like this shouldn’t be necessary if everything is setup properly. (even DT/DT_THERM ~ 300 seems too short, or atleast DT_THERM should handle longer timesteps).

So, I’ve been trying to cover all bases to see where else I might be going wrong and have a couple of questions:

  1. Do we need to shift the atmospheric forcing (JRA) back (-360) to the ACCESS-OM2 longitudinal coordinates? I haven’t seen this done anywhere in the MOM6 regional pipeline but I might have missed this. If this is necessary, then the forcing over the domain isn’t correct - I imagine this could cause error growth/instabilities maybe?

  2. In my most recent run, it failed at initialisation with the following error:

Free surface height dilation attempted to exceed 10-fold.
MOM_ALE, check_grid: negative thickness encountered.
check_grid: i,j=          73          12 h(i,j,:)= -4.825435807313369E-002

For this run, I used a vertical coordinate from a previous run that I know worked fine, although that vertical coordinate has very thin surface layers which could lead to negative thickness I guess. Why would this not have caused errors when used previously though?

Any other suggestions on what to check - I might confirm that the boundary segments are not too dissimilar to the raw slices in ACCESS-OM2 as well.

Just tried running again after shifting longitudes back to align with ACCESS - I ran:

for $jra_var_file in *.nc; do:
    ncap2 -s "lon = lon - 360" -s "lon_bnds = lon_bnds - 360" "$jra_var_file" "$jra_var_file.tmp"
    mv "$jra_var_file.tmp" "$jra_var_file"

But then got the error:

FATAL from PE     0: lookup_es: saturation vapor pressure table overflow, nbad=   3816

So my guess is that this means MOM6 handles the offset ocean/atmosphere longitudes internally somehow?

Yes, I can’t imagine you’d need to manually deal with wrapping around the longitudes, it should be smart enough to handle with that (and it’s never been a problem in any of the other runs using JRA).

I’d suggest having a look at the physical location it’s pointing you to in the errors, and maybe output diagnostics every timestep before the crash to see how it’s evolving. That’s likely to give you something of a clue.

1 Like

EDIT: Maybe this should be a new hive-post?

Sorry I should’ve updated this - Thanks Angus, so I checked the locations that the error was happening, and the first four locations all had some weird missing values in the topography. I generated a new topography using the old approach of make_topog_parallel and deseas.f90 and the model’s now able to run with DT=600, DT_THERM=1200. I’ve run it for a couple of years (1990,1991) with no fatal errors, although I realised this morning that there were a lot of velocity truncations - I’m hoping this is the cause of the EAC separating much further north than it should, and some very strong zonal velocities on the north and south boundaries (see figures below). I’ve halved both timesteps, (although I may have only needed to halve DT?) hoping that things will slowly adjust, but after two months with DT=300, DT_THERM=600, and no velocity truncations, things don’t look much better.

The figures are for the last month ran with the longer timesteps (DT=600,DT_THERM=1200) as the top row, and most recent run with shorter timesteps (DT=300,DT_THERM=600) on the second row. The first two columns are monthly mean merdional/zonal surface velocities respectively, and the third column is a snapshot of daily surface speed.

There’s a lot of strange behaviour going on - i.e., northern EAC extension flow, ripples at the boundaries, and the northern boundary is completely wrong.

Do you think this is still an artefact from the previous months, or is there something else I should look into at the boundaries, i.e., nudging etc.

UPDATE | After 4 months of the shorter DT, things didn’t seem to be improving much so I started again from 1990 with DT=300, DT_THERM=600. Things look a lot more sensible, but I’m about to take a look at how it compares to the global, as well as a few obs in the area.