Hi Sougata,
When you say that your “processing” is very efficient due to chunking, it is not exactly what is happening under the hood.
xarray
uses Dask under the hood to store DataArrays, which provides features like chunking and lazy evaluation.
Lazy evaluation means most of the processing you do on the data is not computed immediately. Instead, Dask stores all the operations that need to be carried out in a task graph, that can be viewed with the DataArray method .visualize()
(for the low-level task graph) or .dask.visualize()
(for the high-level layer graph).
All these tasks (and so the actual processing) will be actually executed only when the data is computed. In your case, this is when you plot the data.
That is why you are experiencing very long processing times only at the time of plotting.
To better understand why it is taking so long to process your data, you can take a look at the total number of tasks your Dask graph has and what that is influenced by.
In general, the more tasks your Dask graph has, the more time it will take to do the processing. Although, this is also influenced by the specific algorithm (operation) you are performing. In addition, chunking greatly influences this performance as well, and will influence the total number of tasks in the Dask graph.
In your case, the Dask graph
field in your DataArray says “552 chunks in 1246 graph layers”, which means there are 1246 total “operations” (the nodes in the high-level layer graph) that need to be carried out. Each of this operations has 552 tasks (the number of your chunks), for a total of approximately 552*1246 = 687792 total tasks!! ( you can also check the total number of tasks with len(yourDataArray.dask)
).
I don’t use xarray
as much anymore, so I might be wrong here, but that sounds like a lot of tasks!!
The main reasons might be your chunks being too small, so you might be able to have less chunks but bigger in size. Also note that every chunk is currently 6MB, which can be safely increased without any memory-related issues.
The main chunking direction (coordinates along which you are chunking) also highly influences performance, and depending on your operations there might be better/worse way to chunk your data.
As mentioned above, I don’t use xarray
and dask
often anymore, so other people might be able to give you more practical indications on best practices for chunking and performance.
As a starting point, you might want to check out Array Best Practices and Dask Best Practices.
Hope this helps.
Cheers
Davide