While loading data for any experiment on gadi using cc.querying(), I get a bunch of warnings (screenshot also attached) like: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/xarray/core/dataset.py:271](https://are.nci.org.au/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/python3.10/site-packages/xarray/core/dataset.py#line=270): UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading. warnings.warn(.
From the warning, I understand that the way I am loading my data is not very efficient, but can this be fixed by adding a flag to cc.querying? I also tried to suppress these warnings by importing the warnings package, but it only works when dask is not in use, otherwise, dask simply overwrites the ignore flag.
Has anyone else also come across these warnings? How did you get rid of them ā they can be pretty annoying for large datasets.
I donāt use cosima cookbook so I cannot help you with the cc.querying() actual fix.
However, dask controls logging of warnings in its configuration, so in your case I believe something like this should work to suppress warning messages in your case:
import dask
import warning
# Disable dask warnings by specifying 'error' as its level of logging
dask.config.set({'logging.distributed': 'error'})
# Disable all other warnings
warnings.filterwarnings("ignore")
Thereās some discussion of these warnings buried deep in this PR. To summarise, the cosima-cookbook tries to open the data with the NetCDF chunking of the variable requested, which in many cases ends up dividing the chunks of other variables in the dataset.
The workaround is to open the data using chunks={"time": "auto"} in the getvar call.
The fix is to get rid of the logic in the cosima-cookbook that sets the default chunks to the requested variableās NetCDF chunks. This should be replaced with chunks={} which opens each variable with itās own NetCDF chunks. Perhaps something for the hackathon 4.0?
Thanks Davide, I tried suppressing these warnings using dask.config and filterwarnings, but it still shows me those warnings, not sure why. Hereā how I start my dask client:
import dask
from dask.distributed import Client
client = Client()
Hi Dougie, thanks! I just added chunks={} within getvar call and all the warnings go away!! I had a quick look at the PR; I agree that we can attempt to fix it during the hackathon.
PS: Adding chunks={"time": "auto"} didnāt go well with xarray and dask, and it simply failed to load any data. But chunks={} works perfectly.
Hi All. For those interested, I think Iāve found the āpythonicā ācorrectā way to deal with this. So the warning is coming from this function, but the catch is itās being executed on the dask workers, and by default, dask workers donāt inherit any logging settings from the client, including logging.captureWarnings. Dask does have a method to have the workers forward their logs to the client to deal with, however, there is no in-built way to have logging.captureWarnings(True) run on all of the workers. So to have this happen, a simple worker plugin needs to be created following the instructions here:
from dask.distributed.diagnostics.plugin import WorkerPlugin
class CaptureWarningsPlugin(WorkerPlugin):
def setup(self,worker):
logging.captureWarnings(True)
def teardown(self,worker):
logging.captureWarnings(False)
This causes the setup method to run immediately on all of the workers. From there, if youāre not handling the py.warning logger, youāre done, all warnings.warn output from the dask workers will disappear into the aether. If, however, you are setting up custom logging handlers, you can run:
client.forward_logging()
(See here). This has dask workers send all logs back to the client, which now includes calls from warnings.warn, to be dealt with using whatever logging configuration has already been set up.
As usual with this kind of thing its way overcomplicated and @dougiesquireās solution is just fine on a case-by-case basis. However, this could be implemented directly in cosima_cookbook to silence and/or control these warnings without changing the dataset chunking.
To be clear, these warnings can be gotten rid of altogether by using chunks = {} internally within the cosima-cookbook. This will not change the default chunking of variables opened with getvar.
Currently, by default getvar looks at the netcdf chunking of the variable being requested, opens the entire dataset using that chunking, and then returns the requested variable. If there are other variables in the dataset that have different netcdf chunking to the requested variable, this can produce these warnings. Using chunks = {} opens each variable in the dataset using its own netcdf chunking.
I am currently trying to run the notebook which I used to run without any issues up until April (?) [I havenāt been working in ARE and with COSIMA for a while].
I am getting the same warnings although I add chunks when loading my data. In addition, the memory error comes up so I canāt load the data:
MemoryError: Task āgetattr-f11b3dc1-3f9f-45bf-ad70-0e3eb059b126ā has 8.15 GiB worth of input dependencies, but worker tcp://127.0.0.1:32957 has memory_limit set to 4.50 GiB.
Also, after the xarray warnings come up, more dask warnings show up in the previous cell (see screenshot).
Also, I updated my previous comment: all the errors occurred using the latest conda 24.04. Downgrading to conda 23.10 solved the Memory Error but xarray warning still come up.
1 Like
Aidan
(Aidan Heerdegen, ACCESS-NRI Release Team Lead)
12
But that is even better! So easy to reproduce an error with a short snippet like that. Also the details can sometimes matter a lot.
Hi @polinash. Iām having a look into this one. Iāve only scratched the surface at this point, but as far as I can tell its internal to Xarray, but Iām not exactly sure where. I can get past the error by launching the dask client with memory_limit=0. I did this, then tried to open the files directly with xarray.open_mfdataset (i.e. bypassed cosima cookbook) and the same error occurred.
xarray.open_mfdataset() is pretty straightforward. If parallel=True, it wraps an open_dataset, then a getattr call (to retrieve the _.close() function for the dataset), then a preprocess as dask.delayed functions for each file, then computes them. Its those subsequent delayed calls that are the issue, the size of the input dependencies is suspiciously close to the size of the actual datasets being loaded. This seems to be an error, as obviously the entire dataset does not need to be loaded into memory to resolve ds._close(), and setting memory_limit=0 confirms that the dask cluster uses nowhere near 8.15GB of memory during open_mfdataset(). So I think the next thing to do is to work out exactly what tasks are being created and sent to the workers, and what has changed between analysis3-23.10 and analysis3-24.04. Iāll keep you posted.