MOM6 HDF crash error

We’ve been getting NetCDF crash errors on our MOM6 PanAntarctic configs since last week. These occur after the model has been running for some months already and has previously not had any problem opening the named files. Examples of the errors:

FATAL from PE   547: NetCDF: HDF error: netcdf_read_data_4d: file:./INPUT/RYF.t_10.1990_1991.nc- variable:tas_10m

Most commonly it complains about the t_10 file above, but sometimes also this one:

FATAL from PE   148: Input/output error: netcdf_read_data_4d: file:./INPUT/RYF.runoff_all.1990_1991.panan01_cropped.nc- variable:friver

Pretty sure it’s not the files themselves, because we’ve been using them for months now without any problems and even in these same runs the same files are used without error up to the point of crash. Any ideas?

Hi Adele, I recently had a similar error running the EAC conifg on Setonix. It occurred at the end of the run, preparing the restarts for the next run. I guess yours is reading, whilst mine is writing but strange either way.

FATAL from PE 0: NETCDF ERROR: NetCDF: HDF error File=RESTART/MOM.res_8.nc Field=CAv

This happened last Friday and I hadn’t seen it before. I re-ran the same time-period with 5-day increments and strangely it didn’t occur again. I’ve got some other issues which I’m about to post about though :sweat_smile:

Are they reproducible? As in, do the same errors occur again if you re-run?

No, they’re not reproducible.

If it’s not reproducible I’d contact help@nci.org.au with details of the runs (PBS ID) and crash logs. It could be a problem with specific nodes “going funky”.

#payu saves all crash logs by default (all praise Marshall). Let me know if you need assistance to locate them.

Response from NCI help (Andrey):

I can think of 3 possible reasons for the problem you are seeing.

  1. Our filesystems. as all filesystems in the world, sometimes may experience a hiccup. This will force I/O operations to be delayed. From the user’s point of view, the I/O operation will go to a wait state and then continue. Depending up on the exact I/O library you are using this may generate an error.

  2. There is a bug in the program that sometimes causes I/O error.

  3. The quota in /scratch/x77 may be close to a limit so the link may fail to be created.

Reason 1 looks unlikely as it affects all other files that your program is using in the /g/data/ua8 directory. You can also try making actual copy of the files somewhere in /scratch to avoid using /g/data/ua8 and see if this helps with the runs.

Reason 3 also looks unlikely as above, but this is one that is more difficult to detect during the run. To avoid this, I would strongly recommend cleaning the /scratch/x77 space as much as possible.

Reason 2 looks most realistic to me.

Any thoughts @angus-g ? This is happening often (every 2nd or 3rd run) since last week, so we have been unable to begin the PanAntarctic production runs yet.

You could try replying with the logic “this has only happened in the last couple of weeks, and we have not recompiled the program, so could it be a persistent error with some of your nodes?”

Was there any indication he’d looked into there being some commonalities in the nodes used? That he’d checked PBS or system logs for the time of the runs?

Ben M is good at tracking down this sort of stuff. You could try requesting he look into it.

I agree with @Aidan there, given that it seems to crash pretty randomly (no correlation with date, etc.), recently and on a specific file, it seems more like a filesystem issue to me. Of course, it’s not ideal that we don’t get any traceback output from the crashes, so it might be good to have an equivalent executable that doesn’t strip that info available…

So if it’s a filesystem issue, we might be able to avoid it (and test if that’s the issue) by copying the forcing data on to /scratch instead of just linking it to there from /g/data/?

Also ua8 is on gdata3, which is being decommissioned next week.

Yeah, I think that’s worth trying at least!

I’m currently tracking down IO errors on gdata3 as well, though I think my situation is a bit different. Andrey’s response is typical Andrey - its your problem. This is in spite of the fact that this Input/Output error comes straight out of the kernel and a user can at best handle the error. I don’t think you’ll get much more help out of NCI, especially given these files are on gdata3. They’re not going to want to spend a lot of time debugging a file system that’s going away in a week.

I’m not sure how active these files are, but is there any chance they were being written to while you were reading them? That’s about the only thing I can think of that would generate this kind of error that a user can control.

If they weren’t, my guess is that the migration would be playing a part in these errors. NCI’s usual procedure for a full filesystem migration is a couple of weeks in advance they’ll start running an MPI-enabled version of rsync in a loop until the actual filesystem downtime, at which point they do the final sync and re-do the project directories configuration such that you get the new filesystem in place of the old one. This MPI-enabled rsync puts quite a bit of stress on the filesystem. It could be that gdata3 just isn’t coping with this extra load at the moment. Syncing your data to scratch in advance is a good idea. It may just be something we need to put up with until the migration is complete. If these errors continue after the migration, then NCI definitely needs to be pressed on it.

2 Likes

Yes, the response from NCI is not super useful!

Ah, sorry, turns out I was confused. ua8 is actually on gdata1a, so this won’t be related to the gdata3 migration.

I’m currently testing running with the data on scratch instead of gdata.

@angus-g do you think your suggestion of making an equivalent executable with more crash traceback output is possible? As it stands at the moment, we’re going to waste a whole lot of hours if we run the 1/40th panan now and it keeps crashing due to this issue.

You could try /scratch/x77/ahg157/MOM6 (maybe copy it to somewhere less ephemeral than that) that I just built. I think it should be built from the exact same code as the executable you’re currently using, but I haven’t tested it.

1 Like

Update:

The runs with the forcing data on /scratch have been running fine for a couple of days now, so it definitely looks like it’s a problem reading forcing data from /g/data. Is that a long term solution to have a copy of the forcing data on /scratch? I guess it might just require a bit of user input/knowledge to copy the data there if it gets quarantined after not being used for 100 days.

I also ran Angus’s new executable for the config with the forcing data on /g/data/ and have an example of a crash here:
/home/157/akm157/mom6/panan-01-crash-traceback/
But I’m not entirely sure what info I’m supposed to be looking for or getting out of the traceback.