Xarray warnings while loading data using cosima cookbook

Hi all,

While loading data for any experiment on gadi using cc.querying(), I get a bunch of warnings (screenshot also attached) like: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/xarray/core/dataset.py:271](https://are.nci.org.au/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/xarray/core/dataset.py#line=270): UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading. warnings.warn(.

From the warning, I understand that the way I am loading my data is not very efficient, but can this be fixed by adding a flag to cc.querying? I also tried to suppress these warnings by importing the warnings package, but it only works when dask is not in use, otherwise, dask simply overwrites the ignore flag.

Has anyone else also come across these warnings? How did you get rid of them – they can be pretty annoying for large datasets.

3 Likes

Yes, definitely not just you. Super annoying! I can’t remember which conda version they started with. A fix would be great!

I don’t use cosima cookbook so I cannot help you with the cc.querying() actual fix.

However, dask controls logging of warnings in its configuration, so in your case I believe something like this should work to suppress warning messages in your case:

import dask
import warning
# Disable dask warnings by specifying 'error' as its level of logging
dask.config.set({'logging.distributed': 'error'})
# Disable all other warnings
warnings.filterwarnings("ignore")
1 Like

There’s some discussion of these warnings buried deep in this PR. To summarise, the cosima-cookbook tries to open the data with the NetCDF chunking of the variable requested, which in many cases ends up dividing the chunks of other variables in the dataset.

The workaround is to open the data using chunks={"time": "auto"} in the getvar call.

The fix is to get rid of the logic in the cosima-cookbook that sets the default chunks to the requested variable’s NetCDF chunks. This should be replaced with chunks={} which opens each variable with it’s own NetCDF chunks. Perhaps something for the hackathon 4.0?

4 Likes

Thanks Davide, I tried suppressing these warnings using dask.config and filterwarnings, but it still shows me those warnings, not sure why. Here’ how I start my dask client:

import dask
from dask.distributed import Client
client = Client()

import warnings
dask.config.set({‘logging.distributed’: ‘error’})
warnings.filterwarnings(“ignore”)

Hi Dougie, thanks! I just added chunks={} within getvar call and all the warnings go away!! I had a quick look at the PR; I agree that we can attempt to fix it during the hackathon.

PS: Adding chunks={"time": "auto"} didn’t go well with xarray and dask, and it simply failed to load any data. But chunks={} works perfectly.

1 Like

Hi All. For those interested, I think I’ve found the ‘pythonic’ ‘correct’ way to deal with this. So the warning is coming from this function, but the catch is it’s being executed on the dask workers, and by default, dask workers don’t inherit any logging settings from the client, including logging.captureWarnings. Dask does have a method to have the workers forward their logs to the client to deal with, however, there is no in-built way to have logging.captureWarnings(True) run on all of the workers. So to have this happen, a simple worker plugin needs to be created following the instructions here:

from dask.distributed.diagnostics.plugin import WorkerPlugin
class CaptureWarningsPlugin(WorkerPlugin):
    def setup(self,worker):
        logging.captureWarnings(True)
    def teardown(self,worker):
        logging.captureWarnings(False)

And then after initialising the client run:

client.register_worker_plugin(CaptureWarningsPlugin())

This causes the setup method to run immediately on all of the workers. From there, if you’re not handling the py.warning logger, you’re done, all warnings.warn output from the dask workers will disappear into the aether. If, however, you are setting up custom logging handlers, you can run:

client.forward_logging()

(See here). This has dask workers send all logs back to the client, which now includes calls from warnings.warn, to be dealt with using whatever logging configuration has already been set up.

As usual with this kind of thing its way overcomplicated and @dougiesquire’s solution is just fine on a case-by-case basis. However, this could be implemented directly in cosima_cookbook to silence and/or control these warnings without changing the dataset chunking.

2 Likes

To be clear, these warnings can be gotten rid of altogether by using chunks = {} internally within the cosima-cookbook. This will not change the default chunking of variables opened with getvar.

Currently, by default getvar looks at the netcdf chunking of the variable being requested, opens the entire dataset using that chunking, and then returns the requested variable. If there are other variables in the dataset that have different netcdf chunking to the requested variable, this can produce these warnings. Using chunks = {} opens each variable in the dataset using its own netcdf chunking.

2 Likes

I am currently trying to run the notebook which I used to run without any issues up until April (?) [I haven’t been working in ARE and with COSIMA for a while].
I am getting the same warnings although I add chunks when loading my data. In addition, the memory error comes up so I can’t load the data:

MemoryError: Task ‘getattr-f11b3dc1-3f9f-45bf-ad70-0e3eb059b126’ has 8.15 GiB worth of input dependencies, but worker tcp://127.0.0.1:32957 has memory_limit set to 4.50 GiB.

Also, after the xarray warnings come up, more dask warnings show up in the previous cell (see screenshot).

Using chunks={} doesn’t help, still get the memory error.

UPD: all the errors occurred using the latest conda 24.04. Downgrading to conda 23.10 solved the Memory Error but xarray warning still come up.

1 Like

Hey @polinash. It’s always a good idea to share the notebook in question so others can check your code.

There is a knowledge-base topic to help with sharing Jupyter notebooks easily:

There is nothing really to share, the warnings come up at the stage of loading data:

session = cc.database.create_session('/g/data/ik11/databases/cosima_master.db')
iaf_cycle3 = '01deg_jra55v140_iaf_cycle3'

first_year = '2018'
last_year = '2018'
start_time=first_year+'-01-01'
end_time=last_year+'-12-31'
time_slice = slice(start_time, end_time)

lat_slice = slice(-78, -55)
lon_slice = slice(-280, -170)

tr_adelie = cc.querying.getvar(iaf_cycle3, 'passive_adelie', session,
                            frequency='1 monthly',
                            attrs={'cell_methods': 'time: mean'},
                            start_time=start_time, end_time=end_time,
                            chunks={'time':12, 'xt_ocean':100, 'yt_ocean':100})

tr_adelie = tr_adelie.sel(time=time_slice).sel(yt_ocean=lat_slice).sel(xt_ocean=lon_slice)

tr_ross = cc.querying.getvar(iaf_cycle3, 'passive_ross',
                            session,
                            frequency='1 monthly',
                            attrs={'cell_methods': 'time: mean'},
                            start_time=start_time, end_time=end_time,
                            chunks={'time':12, 'xt_ocean':100, 'yt_ocean':100})

tr_ross = tr_ross.sel(time=time_slice).sel(yt_ocean=lat_slice).sel(xt_ocean=lon_slice)

Also, I updated my previous comment: all the errors occurred using the latest conda 24.04. Downgrading to conda 23.10 solved the Memory Error but xarray warning still come up.

1 Like

But that is even better! So easy to reproduce an error with a short snippet like that. Also the details can sometimes matter a lot.

Hi @polinash. I’m having a look into this one. I’ve only scratched the surface at this point, but as far as I can tell its internal to Xarray, but I’m not exactly sure where. I can get past the error by launching the dask client with memory_limit=0. I did this, then tried to open the files directly with xarray.open_mfdataset (i.e. bypassed cosima cookbook) and the same error occurred.

xarray.open_mfdataset() is pretty straightforward. If parallel=True, it wraps an open_dataset, then a getattr call (to retrieve the _.close() function for the dataset), then a preprocess as dask.delayed functions for each file, then computes them. Its those subsequent delayed calls that are the issue, the size of the input dependencies is suspiciously close to the size of the actual datasets being loaded. This seems to be an error, as obviously the entire dataset does not need to be loaded into memory to resolve ds._close(), and setting memory_limit=0 confirms that the dask cluster uses nowhere near 8.15GB of memory during open_mfdataset(). So I think the next thing to do is to work out exactly what tasks are being created and sent to the workers, and what has changed between analysis3-23.10 and analysis3-24.04. I’ll keep you posted.

Here is the PR for the changes that cause this: Estimate sizes of xarray collections by fjetter · Pull Request #11166 · dask/dask · GitHub. Dask has learned how to calculate the size of Xarray collections, and as such the scheduler now assumes every operation on an Dataset or Dataarray will load the whole thing into memory and will block those tasks. There is an Xarray issue with ongoing discussion here: Implement __sizeof__ on objects? · Issue #5764 · pydata/xarray · GitHub. Not sure whether I should add to that, start a new issue on the xarray github or on the dask github. This is a Monday problem though.

1 Like

I was reminded that timezones are a thing, and the US workday will be starting soon, I made the issue: `open_mfdataset` fails when Dask worker memory limit is less than dataset size · Issue #9188 · pydata/xarray · GitHub.