How to efficiently chunk data for faster processing and plotting?

sb4233 · 27 September 2024 06:28

Hi all,
I have been using the wonderful COSIMA COOKBOOK to access the OM2 datasets for a while. It’s great for searching and loading datasets, but I am facing a bit of issue with the chunking as I am new to this type of coding. Due to being chunked it is highly efficient to do my processing but when I try to plot these it takes forever.

For example, below is my processed data array for coherence between two variables at every grid point. Now if I want to plot this data (for example using plt.contourf etc.), I think it tries to load the whole data into memory and that’s where it takes forever. I wanted to know if there is a nicer and faster way to do this. Any help would be much appreciated.

atteggiani · 27 September 2024 09:50

Hi Sougata,

When you say that your “processing” is very efficient due to chunking, it is not exactly what is happening under the hood.

xarray uses Dask under the hood to store DataArrays, which provides features like chunking and lazy evaluation.

Lazy evaluation means most of the processing you do on the data is not computed immediately. Instead, Dask stores all the operations that need to be carried out in a task graph, that can be viewed with the DataArray method .visualize() (for the low-level task graph) or .dask.visualize() (for the high-level layer graph).

All these tasks (and so the actual processing) will be actually executed only when the data is computed. In your case, this is when you plot the data.
That is why you are experiencing very long processing times only at the time of plotting.

To better understand why it is taking so long to process your data, you can take a look at the total number of tasks your Dask graph has and what that is influenced by.
In general, the more tasks your Dask graph has, the more time it will take to do the processing. Although, this is also influenced by the specific algorithm (operation) you are performing. In addition, chunking greatly influences this performance as well, and will influence the total number of tasks in the Dask graph.

In your case, the Dask graph field in your DataArray says “552 chunks in 1246 graph layers”, which means there are 1246 total “operations” (the nodes in the high-level layer graph) that need to be carried out. Each of this operations has 552 tasks (the number of your chunks), for a total of approximately 552*1246 = 687792 total tasks!! ( you can also check the total number of tasks with len(yourDataArray.dask)).

I don’t use xarray as much anymore, so I might be wrong here, but that sounds like a lot of tasks!!
The main reasons might be your chunks being too small, so you might be able to have less chunks but bigger in size. Also note that every chunk is currently 6MB, which can be safely increased without any memory-related issues.

The main chunking direction (coordinates along which you are chunking) also highly influences performance, and depending on your operations there might be better/worse way to chunk your data.

As mentioned above, I don’t use xarray and dask often anymore, so other people might be able to give you more practical indications on best practices for chunking and performance.
As a starting point, you might want to check out Array Best Practices and Dask Best Practices.

Hope this helps.

Cheers
Davide

sb4233 · 27 September 2024 10:51

Thanks very much @atteggiani for such a detailed and insightful answer. It cleared much of doubts! I will try out your suggestions : )

navidcy · 28 September 2024 02:07

(if @atteggiani answers clarifies your questions @sb4233 then click the Select if this reply solves the problem at the bottom of the post?)

Thomas-Moore · 29 September 2024 03:43

Hey @sb4233 - if you are working with OM2 regularly and want to explore and help document how to improve xarray / dask workflows then consider jumping over here and documenting more specifics about your use case.

We have spare compute and TB’s of storage to enable analysis-ready data (ARD) workflows in the new vn19 project.

sb4233 · 29 September 2024 05:29

Thanks @Thomas-Moore for introducing vn19! It seems like a great initiative for addressing questions related to analyzing large datasets. Will try to be an active participant

Topic		Replies	Views
Dask remove time chunks for Fourier transforms Technical dask , xarray , chunking	9	387	30 August 2023
A reference for making dask work (faster) Technical dask , knowledge-base	11	1391	22 July 2025
2nd May 2025 - Unlocking the Power of Xarray and Dask 2025 training program	7	192	15 May 2025
Notebook that takes too long Technical python , cosima	12	492	22 January 2023
Xarray warnings while loading data using cosima cookbook Technical python , help , cosima	16	489	2 December 2024

How to efficiently chunk data for faster processing and plotting?

Related topics