'NetCDF: Not a valid ID' errors

@dougiesquire I think I found it. open_mfdataset parallel=True failing with netcdf4 >= 1.6.1 · Issue #7079 · pydata/xarray · GitHub. I didn’t find it before as I didn’t look back far enough.

3 Likes

Nice find! To summarise in this thread, it looks like a work-around in netcdf4-python to deal with netcdf-c not being thread safe was removed in 1.6.1. The solution (for now) is to make sure your cluster only uses 1 thread per worker.

2 Likes

Yep, given that’s impractical to advise all hh5 users to update all of their dask cluster initialisations, I’ll pin netcdf4-python in our analysis3-unstable environment until the issue is resolved.

2 Likes

I’m just here to say WOO HOO! Nice team work @dale.roberts and @dougiesquire.

(Also I marked @dougiesquire’s answer summarising @dale.roberts’ sleuthing as the solution so it shows up at the top of the topic, hope that is ok)

@dale.roberts

People were having issues with this again last week, I see in ‘conda/analysis3-24.04’ the version is not pinned anymore and is now 1.6.5 ?

conda list | grep netcdf
h5netcdf 1.3.0 pyhd8ed1ab_0 conda-forge
libnetcdf 4.9.2 mpi_openmpi_ha1e512f_14 conda-forge
libpnetcdf 1.13.0 mpi_openmpi_h06d7fe7_1 conda-forge
netcdf-fortran 4.6.1 mpi_openmpi_h0a0d5bf_4 conda-forge
netcdf4 1.6.5 nompi_py310h3aa39b3_102 conda-forge

Hi @anton apologies for the delayed response, I’m on leave for the school holidays. Unfortunately we had to unpin netcdf4-python as it was holding back several other packages, to the point where the the environment was no longer able to be updated in place and had to be rebuilt from scratch. I didn’t anticipate the issue being unresolved nearly a year and a half later when I said

I’ll pin netcdf4-python in our analysis3-unstable environment until the issue is resolved.

It looks like there has been little progress on debugging the issue, so I think all we can do is work around it.

2 Likes

I know others about have already pointed this out - but here is comment from June 2024.

This also appears with open_mfdataset, but appears to only happen with a threading scheduler.

In any case, the reason is that netcdf4 is not thread-safe anymore since netcdf4=1.6.1 (I think). Which means this has unfortunately been around for a while and is known (see e.g. #7079), but nobody had the time / skills / persistence to figure out what exactly causes this – race conditions are just that tricky to debug.

I definitely don’t have the skills ( or time ) but this seems to be a pain point for many of the centres of gravity across the community. XCDAT, xarray_dev, Earthmover, NCAR, etc . . .

Where to start? Anyone know Jeff Whitaker? I might reach out to Filipe?