Payu: Model exited with error code 1; aborting. Stuck at a leap year

Hi all,

This is a follow up on a problem mentioned in one of the technical topics (Payu potentially miscalculating a leap year)

I am having trouble trying to run one particular year in ACCESS-ESM1.5. I have completed 299 years of a simulation, and it just won’t run for the 300th year. I tried starting again from a different restart (298th), it runs and writes the output of 299th year, and then runs another year (the 300th), but crashes before writing the output of 300th year.

Did anyone encounter anything similar or have any suggestions?

Thank you.

I will be posting some of the suggestions from @holger below who has been working diligently on this.

1 Like

From @holger:

Looking into the configuration, the model crashes in the year 400.
The crash comes from OASIS, with the error message
oasis_advance_run at 31536000 31536000 ERROR: t_surf
oasis_advance_run ERROR model time beyond namcouple maxtime 31536000
31536000
oasis_advance_run abort by model : 2 proc : 0
This suggests that at least one of the submodels wants to keep going longer than the oasis coupler.
But the configuration hasn’t been changed, it’s still a single year per run.
The model year when it crashes is the year 400, which (according to the gregorian rules) would be a leap year. If oasis doesn’t realise this, oasis might want to quit a single day early.

Looks like OASIS is waiting for the data for december 31st from ice/ocean – but those two submodels might have terminated because they also didn’t realise that it was a leap year.

I notice that work/cice/input_ice.nml seems to think that it’s the year 300, not 400.

So I modified the hack script to change the date for the ice model, but this will only work for the year 400, so the script has to be disabled after this year.
Even if this works (I still don’t know what’s up with the ocean), the issue is likely to re-appear in the year 500 when the ice submodel believes to be in the year 400 (a leap year) and atmosphere submodel is in the year 500 (a non-leap-year)

I’m still working on it. It seems as if MOM knows the year is 400, at least from the file time_stamp.out.

Made any progress with this @holger? Is there anything we can do to assist?

Hello,

I tried one suggestion from @tiloz , which was successful and my model has now crossed the 400th year and is at 404th year right now.

Below is the suggestion that worked: (thanks a lot @tiloz )

I set up a new run, a copy of the existing one, but taking the last restart from the current run (399th model year) and restarted the simulation year counter (so, starting again counting from year 101). So now, I have new output and restart folders as output/restart001,002…, but the model year continues as 400,401 etc.

I will keep checking my outputs.

1 Like

In case it is useful in the future, you can tell payu what number to start the run from.

In your case adding a restart option to your config.yaml file to point to your restart directory and then using

payu run -i 400

would start the run counter at 400 to match up to your previous run.

Thanks. I will keep this in mind.

1 Like

I had another thought: ACCESS-OM2 had issues with MOM5 using a gregorian calendar, but actually implementing proleptic_gregorian

This causes issues with offsets compared with software that was implementing gregorian correctly (which is actually a mix of gregorian and julian).

The fix was to use a base time after 1582, e.g. 1900-01-01 works.

1 Like

Hi,
A quick note to report that I am still running into this problem in a different simulation. I have reached year 400 (a difficult leap / non-leap year to get right apparently), and the model hangs. I’m going to bypass it for now by “jumping” the calendar to year 2400, to make it easier to keep track in my head of continuing the model years.
Regards,
Dave

Hi @HIMADRI_SAINI, thanks for bringing this up in the working group meeting. I’d be interested in looking into this error and trying to reproduce it. Would you still have the configurations and restart files for the years just prior to the crash?

Hi @spencerwong. Sorry, I have been looking into my simulations and found that the last restart I have in that simulation is the 395th year. I guess I deleted the next 4 years while transferring or archiving my data. What I can do is run my simulation for the next 5 years, and if/when it gets stuck at the 399th year, I’ll let you know. Does that sound good?

That sounds like a good plan!

Hi @spencerwong . I now have the last working restart and the error in /home/561/hs2549/49ka-ic/

Hi everyone, just adding in an update for this.

Working through some of the simulations with @HIMADRI_SAINI, it looks like we have a good idea of how this crash is occurring. The short version is that payu’s calendar calculations for the cice model are currently a bit opaque and convoluted, which can cause cice to be out of sync with the other model components in specific situations when a restart directory is used across different configurations.

For anyone who is interested I’ll run through some details below, but will note that this is on the agenda now as something to improve in a future updates of payu, and so I’m hoping the following details will soon become outdated.


Currently payu sets the cice model’s start date by reading an initialisation date from the init_date parameter in the <control-directory>/input_ice.nml namelist file. Payu then adds a “number of seconds previously simulated” to this initialisation date , calculated from the runtime0 and runtime parameters in the identically named <restart-directory>/input_ice.nml file. The resulting date is then used payu’s start date.

For example in the current pre-industrial configuration, <control-directory>/input_ice.nml has init_date = 00010101 (YYYYMMDD), and <restart-directory>/input_ice.nml has runtime0=3155673600 (seconds) and runtime=0. Adding these all together gives a start date of 01010101.

The ocean and atmosphere read their start dates from text files in the restart directory if they exist, and otherwise fall back to settings in the control directory namelists, but don’t combine information between the two.

Where payu’s mixing of information in the restart and control directories can go wrong is when a single restart is used across two different experiments that happen to have different init_date settings in their ice control directories. The calculated start dates will differ, and cice will be out of sync with the other components in one of the experiments.

One way this can happen is when using the available warm-start.sh scripts to branch off from a CSIRO simulation. These control the ice start date by adjusting the init_date parameter to the desired start date in the control directory, and setting the runtime0 and runtime parameters to 0 in the restart directory. If the resulting restart directory is copied over to another experiment which doesn’t contain the corresponding init_date change, then payu will calculate the wrong start date for cice.


The above behaviour is quite confusing and intuitively I think it would make sense for a restart directory to contain all the required timing information. As noted, there’s currently a discussion here about cleaning up these calculations in payu to make them more transparent and to prevent these crashes. Feel free to bring up any ideas or suggestions!

3 Likes

Thank you @spencerwong for this really useful insight into how the cice calendar can get misaligned with the ocean and atmosphere. I am now using this advice to (hopefully) fix some calendar-related timeouts we’ve been having.

1 Like

2 posts were split to a new topic: ESM1.5 payu: Model exited issue