How to efficiently chunk data for faster processing and plotting?

Hi all,
I have been using the wonderful COSIMA COOKBOOK to access the OM2 datasets for a while. It’s great for searching and loading datasets, but I am facing a bit of issue with the chunking as I am new to this type of coding. Due to being chunked it is highly efficient to do my processing but when I try to plot these it takes forever.

For example, below is my processed data array for coherence between two variables at every grid point. Now if I want to plot this data (for example using plt.contourf etc.), I think it tries to load the whole data into memory and that’s where it takes forever. I wanted to know if there is a nicer and faster way to do this. Any help would be much appreciated.

1 Like

Hi Sougata,

When you say that your “processing” is very efficient due to chunking, it is not exactly what is happening under the hood.

xarray uses Dask under the hood to store DataArrays, which provides features like chunking and lazy evaluation.

Lazy evaluation means most of the processing you do on the data is not computed immediately. Instead, Dask stores all the operations that need to be carried out in a task graph, that can be viewed with the DataArray method .visualize() (for the low-level task graph) or .dask.visualize() (for the high-level layer graph).

All these tasks (and so the actual processing) will be actually executed only when the data is computed. In your case, this is when you plot the data.
That is why you are experiencing very long processing times only at the time of plotting.

To better understand why it is taking so long to process your data, you can take a look at the total number of tasks your Dask graph has and what that is influenced by.
In general, the more tasks your Dask graph has, the more time it will take to do the processing. Although, this is also influenced by the specific algorithm (operation) you are performing. In addition, chunking greatly influences this performance as well, and will influence the total number of tasks in the Dask graph.

In your case, the Dask graph field in your DataArray says “552 chunks in 1246 graph layers”, which means there are 1246 total “operations” (the nodes in the high-level layer graph) that need to be carried out. Each of this operations has 552 tasks (the number of your chunks), for a total of approximately 552*1246 = 687792 total tasks!! ( you can also check the total number of tasks with len(yourDataArray.dask)).

I don’t use xarray as much anymore, so I might be wrong here, but that sounds like a lot of tasks!!
The main reasons might be your chunks being too small, so you might be able to have less chunks but bigger in size. Also note that every chunk is currently 6MB, which can be safely increased without any memory-related issues.

The main chunking direction (coordinates along which you are chunking) also highly influences performance, and depending on your operations there might be better/worse way to chunk your data.

As mentioned above, I don’t use xarray and dask often anymore, so other people might be able to give you more practical indications on best practices for chunking and performance.
As a starting point, you might want to check out Array Best Practices and Dask Best Practices.

Hope this helps.

Cheers
Davide

3 Likes

Thanks very much @atteggiani for such a detailed and insightful answer. It cleared much of doubts! I will try out your suggestions : )

1 Like

(if @atteggiani answers clarifies your questions @sb4233 then click the Select if this reply solves the problem at the bottom of the post?)

2 Likes

Hey @sb4233 - if you are working with OM2 regularly and want to explore and help document how to improve xarray / dask workflows then consider jumping over here and documenting more specifics about your use case.

We have spare compute and TB’s of storage to enable analysis-ready data (ARD) workflows in the new vn19 project.

3 Likes

Thanks @Thomas-Moore for introducing vn19! It seems like a great initiative for addressing questions related to analyzing large datasets. Will try to be an active participant :slight_smile:

2 Likes