2nd May 2025 - Unlocking the Power of Xarray and Dask

paigem · 1 May 2025 23:39

Session 5: Unlocking the Power of Xarray and Dask

Presenters: Jemma Jeffree @jemmajeffree
When: 11am, 2 May 2025 (Friday)
Where: Zoom

Description: Xarray and Dask are increasingly becoming key tools for ocean modellers. However, both vectorisation with xarray and lazy computing with dask remain an enigma to many. This session will explain what happens under the hood of these libraries and provide tips to optimize your code performance. We’ll also walk through the logical steps to transform complicated analyses from for-loops into efficient vectorised functions, using real-world oceanographic examples.

Prerequisites

Basic familiarity with Python, Xarray, Dask and how to load large datasets.

jemmajeffree · 2 May 2025 00:19

Not necessary for attending the session, but as a reference here are my slides and jupyter notebooks from the session:

slides:
xarraydask-Jeffree.pdf (1.4 MB)

notebooks:

gist.github.com

https://gist.github.com/jemmajeffree/58938195196e95cd4374648bc5fcc0cf

whatisfastcode.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fa716f81-5b04-42ed-80e0-191db62357d0",
   "metadata": {},
   "source": [
    "# What does fast/good code mean?"
   ]
  },

This file has been truncated. show original

whatisxarraydask.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "35992742-b159-4a9e-b0f9-1c865757f10a",
   "metadata": {},
   "source": [
    "# What are dask and xarray?\n",
    "Designed to be run on 4 cores on gadi. Needs access to cj50"
   ]

This file has been truncated. show original

(also available /scratch/public/jj8842_20250502 on the day of the presentation)

jemmajeffree · 2 May 2025 02:19

I’ll also note that I’m happy to take further questions here

cbull · 8 May 2025 02:24

Select feedback

Thank you Jemma! This was an awesome presentation! So much care Very helpful.

Ditto I learned so much

Thank you! Super useful

Great presentation Jemma!

Thanks Jemma that was awesome!

Thank you! Great topic and presentation.

paigem · 8 May 2025 08:11

Notes taken during the session

Hopefully these are helpful! I did my best to capture most topics discussed - apologies for any incorrect or unfinished thoughts.

Introduction: Xarray and Dask

They underpin all analysis you do of model output in climate
But both a bit of an enigma
Often get code from others, and then difficult to troubleshoot when those lines of code stop working
This session: what’s going on under the hood so people know how to start troubleshooting problems
- Aim: write faster/better code
- What does it mean to write good/fast code?
- What is “lazy” computing? What is chunking?
Most examples will be for averaging, at the end an example will be worked through with more steps

Why bother to write faster/better code?

Saves time

If spend 5 minutes thinking, will save 20 minutes of writing code
Don’t have to always write nicer code - can be faster to write a hack solution sometimes

Does more

If can write something that runs 10 times faster, can do 10 times more
If can save storage space, can also do more

Makes something work at all

Code demo using numpy and time libraries

Example showing how much time to compute 1+1 once vs running that over 10^6 eg
- Even a small computation takes a while if run over and over
You can time different sections of code
Python backends to C
- There is a time associated with passing Python onto C —> useful to replace loops with vectorization if possible (though the vectorization still does take time)
Example: was taking long to load the data rather than the computation within each for loop —> she swapped her loops around to minimize the number of times she had to load in of data

Code demo using Xarray

Loads labeled dimensions and variables
- It’s a wrapper for a Numpy or Dask array
- Dask array: “a numpy array waiting to happen”

Lazy computing

~ “do it later”
- Computer keeps track of what you want to do, but doesn’t actually do it until you explicitly say .load() or .compute()
Dask breaks arrays up into smaller pieces
- Eg summing in different chunks - sum over each chunk, and then sum over all chunks

Demo: using Dask

Need to use a client from the Dask library (best to copy/paste from her notebook)
- Memory limit = 0 is asking Dask not to kill the code, so it’s up to Gadi to do that
- “client.amm.start()” - (amm = active memory manager) can sometimes save some time, so worth adding this line
Dask dashboard
- Dashboard windows that she likes to use: Worker memory, progress, task stream and scheduler system
Dask tells us how many chunks and graph layers are needed for any computation
- Can use .visualize() - useful for small computations (e.g. 1 chunk in 3 graph layers), but not recommended for large computations. Better to just read the “Dask graph” line in the inline preview in the notebook
It takes time to communicate across tasks
.load() - returns the value and saves it in that variable, .compute() just returns the value, but doesn’t save the variable
- .persist() shoves things onto individual cores
If running in a python script, need to include the client lines inside of a function or if statement - view COSIMA tutorial for specific syntax

Demo: using Dask with ACCESS-OM2-01 data

Loaded in chunks as they are stored on disk
Chunk sizes: rule of thumb, ~100MB to 1GB, but can vary a lot as to what is the optimum
Tip: can add %time at start of code line to time how long that line takes to run
Overhead to put pieces on each worker is a couple of milliseconds —> if tasks take a couple milliseconds, that is a sign that chunks are likely too small
If want to increase chunk size (common for netcdf)
- Should not use .chunk() —> end up with even more little pieces
  - Loads in with wrong chunks, then rechunks and then does the math
  - Want to specify chunks when you read in the data
“parallel=True” - spreads across all workers. Doesn’t always make a difference, but can make things faster
After loading in new chunks, the individual tasks are taking ~0.5 secs, so better for parallelizing
What if we say we want no chunks when reading in the data?
- Will hang for ages - is grabbing all of the ocean depth and then only doing the surface layer
If chunk sizes are too big, each worker will hold too much memory and will likely crash

Example of underlying code: salinity standard deviation

A lot of people say that the “xarray code is wrong”, but it actually comes down to how we compute variance
Can end up with last decimal place in the computation being slightly off if the two sums that you are taking a difference of are very large, and then the result is very small (issue is that not enough decimal places may be saved)
Example of how it’s important to know a little bit of what’s going on under the hood to know how code might behave

Black magic of chunking

Use multiples of on-disk chunking
Chunk when reading data in ideally, or sometimes if small dataset but big computation, can do after loading
Best to test with a small subset and watch the task stream —> when do full computation can see if something looks funny or different from when you ran the smaller subset
Dask uses ~2X the RAM of a dataset
Sometimes Dask is not helpful - likely because overhead for using Dask is too much
- Sometimes faster to just shove small pieces into a for loop
Tip for writing to disk taking slow - often the problem is the calculating before writing to disk
- If writing to Zarr, can work much better/faster than writing to netcdf
Tip is to use trial and error - read in data and then view native chunking before deciding how to read in optimal chunk sizes

lidefi87 · 12 May 2025 01:35

Hi. I was checking when Jemma was delivering the second part of her tutorial so I can add it to my calendar and the Google document says it’s scheduled for May 22nd, which is a Thursday. I wanted to confirm that this is definitely happening on Thursday instead of the usual Friday (May 23).

paigem · 12 May 2025 01:52

Hi @lidefi87! Thanks for your question. Yes, Jemma’s second Dask/Xarray session will been Thursday, May 22nd during the weekly COSIMA meeting time slot (11:30am). We will make some more posts in the coming few days about this to make sure it’s clear for everyone. Thanks!

paigem · 15 May 2025 03:23

Session 5 (part 2!): Unlocking the Power of Xarray and Dask

Part 2 will focus on “translating” code into Xarray/Dask - how to write analysis scripts that use Xarray and Dask effectively

NOTE: this session will take place during the weekly COSIMA meeting time slot on Thursday, May 22 at 11:30am

Presenter: Jemma Jeffree @jemmajeffree
When: 11:30am, 22 May 2025 (Thursday)
Where: Zoom NOTE the different Zoom link for this session!

Topic		Replies	Views
A reference for making dask work (faster) Technical dask , knowledge-base	11	1271	22 July 2025
How to efficiently chunk data for faster processing and plotting? Technical python , cosima , access-om2	5	166	29 September 2024
Presentation on dask and xarray Technical	2	397	9 May 2023
Dask remove time chunks for Fourier transforms Technical dask , xarray , chunking	9	371	30 August 2023
Notebook that takes too long Technical python , cosima	12	477	22 January 2023

2nd May 2025 - Unlocking the Power of Xarray and Dask

Session 5: Unlocking the Power of Xarray and Dask

Notes taken during the session

Session 5 (part 2!): Unlocking the Power of Xarray and Dask

Related topics