Running MOM6 on Setonix, starting from 1970-01-01, with IAF forcing, I receive this warming/error combo:
WARNING from PE 2726: btstep: eta has dropped below bathyT: -2.0273254780607175E+02 vs. -1.6440260515351190E+02 at -2.0707E+02 -3.2295E+01 377 581
...
WARNING from PE 2851: MOM_diabatic_aux.F90, applyBoundaryFluxesInOut(): Mass created. x,y,dh= -2.064E+02 -3.115E+01 -9.905E-05
...
FATAL from PE 2851: MOM_regridding: adjust_interface_motion() - implied h<0 is larger than roundoff!
The first warning, which if Iām interpreting it correctly, suggests that the surface height (eta) value is -202m compared to the depth which is -164m.
The second warning, which also originates from the same interior location, is showing that heat_content_massout < 0 (in some cases this is on the order of -10E+01). Is there mass being removed because the ocean surface is outcropping on the seafloor?
Then finally, things get too extreme I guess and the Fatal Error occurs.
Things to note:
This instability is originating well away from the boundary in relatively shallow water.
From the mom6.err, it looks like this instability propagates away from this initial source, with amplitudes growing ridiculously (at some points amplitudes exceed 10E+40 m).
The error occured after running fine for ~100 days.
Baroclinic timestep is already quite short - DT = 300 | DT_THERM = 600
Places Iāve seen this pop up in the past:
@ashjbarnes and I have seen this happen when there was no ābuffer regionā in the unprocessed segments at the boundary for the interpolation of forcing segments - but if I remember correctly, the instabilities originated at the boundaries, whereas here, itās in the interior.
I noticed in the mom6/panan wiki, this same warning/error is described as a result of having NaNs as the _FillValue attribute inforcing fields.
In this MOM6-examples github issue - where it might occur due to āan uninitialised variableā/ bug somewhere in source code.
In all of these cases, the error occurred shortly after initialisation, whereas here it occurs after 100 days so I think itās unlikely to be associated with any of these.
Any ideas what else could cause this? I should note, the warnings originate in the EAC separation region that has extremely large surface height fluctuations due to the eddy field. Also, the vertical layers start with a thickness of ~0.5 m at the surface. Could this occur if eta > dz at surface?
UPDATE: halving the DT and DT_THERM has prevented the error happening on the date it was occurring previously. However now the timesteps are: DT=150, DT_THERM=300, which has resulted in almost doubling the runtime.
The model resolution is 30th degree - are there any other factors that can increase stability other than shortening the above timesteps?
Do you have any luck with keeping DT=300 as it was, but also setting DT_THERM=300? Sounds like you werenāt hitting a CFL limit before, so it shouldnāt be necessary to drop the baroclinic timestep? There were issues with panan when DT_THERM was too long.
Ah okay thatās good to know. Well, things are running smoothly for now with both timesteps at 300. Keeping DT at 300 has reduced the runtime by a third compared to having DT at 150. Thanks again!
On this issue, Iām hitting particular months where with DT=300, DT_THERM=300, I still get the error noted in the title of this post. This might be one in every 6 months I get this error. For these months, Iām then reducing both timesteps to 150. This is both extremely costly, and from reading back through the panan-01 work testing DT_THERM, it seems like this shouldnāt be necessary if everything is setup properly. (even DT/DT_THERM ~ 300 seems too short, or atleast DT_THERM should handle longer timesteps).
So, Iāve been trying to cover all bases to see where else I might be going wrong and have a couple of questions:
Do we need to shift the atmospheric forcing (JRA) back (-360) to the ACCESS-OM2 longitudinal coordinates? I havenāt seen this done anywhere in the MOM6 regional pipeline but I might have missed this. If this is necessary, then the forcing over the domain isnāt correct - I imagine this could cause error growth/instabilities maybe?
In my most recent run, it failed at initialisation with the following error:
For this run, I used a vertical coordinate from a previous run that I know worked fine, although that vertical coordinate has very thin surface layers which could lead to negative thickness I guess. Why would this not have caused errors when used previously though?
Any other suggestions on what to check - I might confirm that the boundary segments are not too dissimilar to the raw slices in ACCESS-OM2 as well.
Yes, I canāt imagine youād need to manually deal with wrapping around the longitudes, it should be smart enough to handle with that (and itās never been a problem in any of the other runs using JRA).
Iād suggest having a look at the physical location itās pointing you to in the errors, and maybe output diagnostics every timestep before the crash to see how itās evolving. Thatās likely to give you something of a clue.
Sorry I shouldāve updated this - Thanks Angus, so I checked the locations that the error was happening, and the first four locations all had some weird missing values in the topography. I generated a new topography using the old approach of make_topog_parallel and deseas.f90 and the modelās now able to run with DT=600, DT_THERM=1200. Iāve run it for a couple of years (1990,1991) with no fatal errors, although I realised this morning that there were a lot of velocity truncations - Iām hoping this is the cause of the EAC separating much further north than it should, and some very strong zonal velocities on the north and south boundaries (see figures below). Iāve halved both timesteps, (although I may have only needed to halve DT?) hoping that things will slowly adjust, but after two months with DT=300, DT_THERM=600, and no velocity truncations, things donāt look much better.
The figures are for the last month ran with the longer timesteps (DT=600,DT_THERM=1200) as the top row, and most recent run with shorter timesteps (DT=300,DT_THERM=600) on the second row. The first two columns are monthly mean merdional/zonal surface velocities respectively, and the third column is a snapshot of daily surface speed.
Thereās a lot of strange behaviour going on - i.e., northern EAC extension flow, ripples at the boundaries, and the northern boundary is completely wrong.
Do you think this is still an artefact from the previous months, or is there something else I should look into at the boundaries, i.e., nudging etc.
UPDATE | After 4 months of the shorter DT, things didnāt seem to be improving much so I started again from 1990 with DT=300, DT_THERM=600. Things look a lot more sensible, but Iām about to take a look at how it compares to the global, as well as a few obs in the area.
Will do! So far all Iāve tried is changing DT_THERM in MOM_input and all Iāve learned is that the model will still crash if you forget that itās also defined in MOM_override
I think one of the main things to do here is to check the topog at the points this first pops up. Itās hopefully somewhat obvious why itās happening where it is (or maybe Iām getting this one confused with the āextreme sfc valuesā) but either way, you could try some gridpoint editing with āedit_topo.pyā in the grid tools to see if that helps.