'NetCDF: Not a valid ID' errors

Over the last week or so I’ve been getting a lot of these errors:

RuntimeError: NetCDF: Not a valid ID

while running jupyter notebooks and using the COSIMA cookbook for loading data. They’re somewhat random, because sometimes the error is just a warning and it still calculates fine, and other times it’s fatal. And I don’t get the error repeatably. There is an example script with both warning and fatal types of these errors here.

I’m pretty sure this exact script was working fine a couple of weeks ago and I don’t think I’ve changed much since then.

I could well be wrong, but this is possibly due to multiple threads trying to read data from the same file at the same time.

Two things you could try:

  • Set up your dask cluster to have one thread per process. In your case:
    client = Client(n_workers=28)
  • Specify the chunking when you first query the data to be one chunk per file, though this could result in very large chunks. You can pass a chunks dictionary to cc.querying.getvar()

Note that by default, I think cosima-cookbook queries return dask-backed xarray objects chunked in the same way as the netcdf files. However netcdf chunk sizes are often much smaller than well-sized dask chunks.

1 Like

Thanks @dougiesquire. I agree it does sound like something like that. @angus-g @micael , has anything changed in the COSIMA cookbook recently that might be causing this behaviour?

Nope, more likely an xarray change. Kind of looks to me like something is trying to cache the files, but a worker dies so the cache is stale and the file handle is no longer valid? Can’t really say for sure.

I did a bit more testing, and found it only occurs with analysis3-22.10 and later. With analysis3-22.07 and earlier I don’t get these errors. It seems like this is going to be quite problematic going forward for many of our current analyses?

Hi @adele157.

The last analysis3-22.10 update was on the 24th of January, quite a few packages that may or may not be relevant to this error were updated, including xarray, dask, distributed, netcdf4 python and hdf5. I’ve been scanning through issues in a few of those repos and there isn’t anything that really matches this. We’ve had another netCDF issue that’s come up in the last few days over on CWSHELP that may also be related. Have you tried @dougiesquire’s idea? I had a go at reproducing this error and found that everything worked as expected when I used client = Client(n_workers=28). This is problematic though, the default should just work. I’ll see if reverting xarray in a test environment helps.

I’ve also been seeing these errors a lot.

In the cosima recipes examples we’ve been removing the n_workers keyword from the dask client so they work regardless how many cores the users fire up. How can we remedy this?

I thought the default for Client was processes=True, in which case I’m not sure @dougiesquire’s suggestion is doing anything beyond the default. Alternatively there’s threads_per_worker which you could set to 1 for the same effect?

That’s right, but I think that by default dask will choose the number of processes to be ~sqrt(cores). E.g. in @adele157’s linked notebook the cluster comprises 7 workers, each with 4 threads. I don’t use LocalCluster that much - is this maybe a new default behaviour?

@angus-g every test I did, if I didn’t specify n_workers it would give me 7 processes each with 4 threads on a full broadwell node. Reverting xarray to 2021.12.0 did seem to remedy the problem with @adele157’s example notebook without having to specify n_workers. I might revert xarray in analysis3-unstable while we investigate further. It seems strange to me given the huge userbase of dask + xarray + netcdf that nothing along these lines has been reported anywhere else.

People tend to think that it’s something they did wrong and that the software is perfect.

(That might be a small explanation to why we haven’t seen reporting elsewhere.)

Or it may be that it also involves some weird coincidence with NCI or with the way we saved ACCESS-OM2 nc files.

Yes I could have reported it a week ago, but wanted to check I wasn’t doing anything stupid first…

The errors go away when I use n_workers=28 as @dougiesquire suggested.

1 Like

Oh, I think I saw this type of errors when testing the notebooks from the COSIMA recipes. So maybe some of the notebooks that I marked as not working are actually fine. I’ll rerun them and double check.

Nevertheless, this might be problematic for the automatic tests, as they use analysis3 and we might get random failures.

Interesting! It looks like it comes from nprocesses_nthreads(). Well, for a method that is still general for CPU counts we could just do client = Client(threads_per_worker=1).

1 Like

xarray has now been reverted to 2022.12.0 in analysis3-unstable. I’ve run through @adele157’s example notebook a few times now with no errors. Whilst its not a fix, it’ll hopefully get things working until an actual fix can be put in place.

1 Like

Thanks @dale.roberts! Does that mean analysis3-22.10 is still broken and should be avoided by COSIMA peeps?

analysis3-unstable is analysis3-22.10 so it should be OK.

1 Like

I’m not convinced that this issue is related to xarray version. I created two conda environments that differ only in their version of xarray: one with v2022.12.0 and one with v2023.01.0. Adele’s warnings/errors are intermittently produced using both:

@adele157 does the reversion of the analysis3 environment seem to have fixed your problem? @dale.roberts did any other dependencies change when you reverted xarray?

Here’s the diff between the environment installed on the 24th of January, vs the update on the 3rd of Feb

67c67
<   - black=22.12.0=py39hf3d152e_0
---
>   - black=23.1.0=py39hf3d152e_0
138c138
<   - conda-build=3.23.3=py39hf3d152e_0
---
>   - conda-build=3.23.3=py39hf3d152e_1
156,157c156,157
<   - dask=2023.1.0=pyhd8ed1ab_0
<   - dask-core=2023.1.0=pyhd8ed1ab_0
---
>   - dask=2023.1.1=pyhd8ed1ab_0
>   - dask-core=2023.1.1=pyhd8ed1ab_0
167c167
<   - datashader=0.14.3=pyh1a96a4e_0
---
>   - datashader=0.14.4=pyh1a96a4e_0
179c179
<   - distributed=2023.1.0=pyhd8ed1ab_0
---
>   - distributed=2023.1.1=pyhd8ed1ab_0
190c190
<   - eccodes=2.27.1=h7f7619e_0
---
>   - eccodes=2.28.0=h7513371_0
213c213
<   - ffmpeg=5.1.2=gpl_h8dda1f0_105
---
>   - ffmpeg=5.1.2=gpl_h8dda1f0_106
261c261
<   - gh=2.21.2=ha8f183a_0
---
>   - gh=2.22.1=ha8f183a_0
291c291
<   - greenlet=2.0.1=py39h5a03fae_0
---
>   - greenlet=2.0.2=py39h227be39_0
344c344
<   - iris=3.0.4=py39hf3d152e_0
---
>   - iris=3.4.0=pyhd8ed1ab_0
493c493
<   - libva=2.16.0=h166bdaf_0
---
>   - libva=2.17.0=h0b41bf4_0
521c521
<   - mapclassify=2.4.3=pyhd8ed1ab_0
---
>   - mapclassify=2.5.0=pyhd8ed1ab_1
590c590
<   - netcdf4=1.6.2=nompi_py39hfaa66c4_100
---
>   - netcdf4=1.6.0=nompi_py39h94a714e_103
627c627
<   - packaging=21.3=pyhd8ed1ab_0
---
>   - packaging=23.0=pyhd8ed1ab_0
632c632
<   - panel=0.14.2=pyhd8ed1ab_0
---
>   - panel=0.14.3=pyhd8ed1ab_0
655c655
<   - pip=22.3.1=pyhd8ed1ab_0
---
>   - pip=23.0=pyhd8ed1ab_0
659c659
<   - plotly=5.12.0=pyhd8ed1ab_1
---
>   - plotly=5.13.0=pyhd8ed1ab_0
719c719
<   - pykrige=1.6.1=py39h3811e60_1
---
>   - pykrige=1.7.0=py39hb9d737c_1
728,729c728,729
<   - pyqt=5.15.7=py39h18e9c17_2
<   - pyqt5-sip=12.11.0=py39h5a03fae_2
---
>   - pyqt=5.15.7=py39h5c7b992_3
>   - pyqt5-sip=12.11.0=py39h227be39_3
732c732
<   - pysal=2.7.0=pyhd8ed1ab_0
---
>   - pysal=23.1=pyhd8ed1ab_0
744c744
<   - python-eccodes=1.4.2=py39h2ae25f5_1
---
>   - python-eccodes=1.5.1=py39h389d5f1_0
804,805c804,805
<   - scikit-learn=1.2.0=py39h86b2a18_0
<   - scipy=1.8.1=py39hddc5342_3
---
>   - scikit-learn=1.2.1=py39h86b2a18_0
>   - scipy=1.10.0=py39h7360e5f_0
834c834
<   - spaghetti=1.6.10=pyhd8ed1ab_0
---
>   - spaghetti=1.7.2=pyhd8ed1ab_0
858c858
<   - spopt=0.4.1=pyhd8ed1ab_0
---
>   - spopt=0.5.0=pyhd8ed1ab_0
914c914
<   - ujson=5.5.0=py39h5a03fae_1
---
>   - ujson=5.7.0=py39h227be39_0
945c945
<   - xarray=2023.1.0=pyhd8ed1ab_0
---
>   - xarray=2022.12.0=pyhd8ed1ab_0
955c955
<   - xclim=0.40.0=pyhd8ed1ab_0
---
>   - xclim=0.37.0=pyhd8ed1ab_0

So of the relevant looking stuff there, netcdf was reverted to an earlier version, and dask was updated. I’ve run @adele157’s case a few times in the current analysis environment and I didn’t get those errors. I’ve kept a copy of the previous environment, and can confirm those errors are occurring there.

Yes from what I can tell the reversion has fixed it. I haven’t had any more errors since then.