2nd May 2025 - Unlocking the Power of Xarray and Dask

Session 5: Unlocking the Power of Xarray and Dask

:disguised_face: Presenters: Jemma Jeffree @jemmajeffree
:alarm_clock: When: 11am, 2 May 2025 (Friday)
:house: Where: Zoom

:teacher: Description: Xarray and Dask are increasingly becoming key tools for ocean modellers. However, both vectorisation with xarray and lazy computing with dask remain an enigma to many. This session will explain what happens under the hood of these libraries and provide tips to optimize your code performance. We’ll also walk through the logical steps to transform complicated analyses from for-loops into efficient vectorised functions, using real-world oceanographic examples.

:laptop: Prerequisites

  1. Basic familiarity with Python, Xarray, Dask and how to load large datasets.
1 Like

Not necessary for attending the session, but as a reference here are my slides and jupyter notebooks from the session:

slides:
xarraydask-Jeffree.pdf (1.4 MB)

notebooks:

(also available /scratch/public/jj8842_20250502 on the day of the presentation)

6 Likes

I’ll also note that I’m happy to take further questions here

2 Likes

Select feedback

Thank you Jemma! This was an awesome presentation! So much care :slightly_smiling_face:Very helpful.

Ditto I learned so much

Thank you! Super useful :slight_smile:

Great presentation Jemma!

Thanks Jemma that was awesome!

Thank you! Great topic and presentation.

2 Likes

Notes taken during the session

Hopefully these are helpful! I did my best to capture most topics discussed - apologies for any incorrect or unfinished thoughts.

Introduction: Xarray and Dask

  • They underpin all analysis you do of model output in climate
  • But both a bit of an enigma
  • Often get code from others, and then difficult to troubleshoot when those lines of code stop working
  • This session: what’s going on under the hood so people know how to start troubleshooting problems
    • Aim: write faster/better code
    • What does it mean to write good/fast code?
    • What is “lazy” computing? What is chunking?
  • Most examples will be for averaging, at the end an example will be worked through with more steps

Why bother to write faster/better code?

  1. Saves time
  • If spend 5 minutes thinking, will save 20 minutes of writing code
  • Don’t have to always write nicer code - can be faster to write a hack solution sometimes
  1. Does more
  • If can write something that runs 10 times faster, can do 10 times more
  • If can save storage space, can also do more
  1. Makes something work at all

Code demo using numpy and time libraries

  • Example showing how much time to compute 1+1 once vs running that over 10^6 eg
    • Even a small computation takes a while if run over and over
  • You can time different sections of code
  • Python backends to C
    • There is a time associated with passing Python onto C —> useful to replace loops with vectorization if possible (though the vectorization still does take time)
  • Example: was taking long to load the data rather than the computation within each for loop —> she swapped her loops around to minimize the number of times she had to load in of data

Code demo using Xarray

  • Loads labeled dimensions and variables
    • It’s a wrapper for a Numpy or Dask array
    • Dask array: “a numpy array waiting to happen”

Lazy computing

  • ~ “do it later”
    • Computer keeps track of what you want to do, but doesn’t actually do it until you explicitly say .load() or .compute()
  • Dask breaks arrays up into smaller pieces
    • Eg summing in different chunks - sum over each chunk, and then sum over all chunks

Demo: using Dask

  • Need to use a client from the Dask library (best to copy/paste from her notebook)
    • Memory limit = 0 is asking Dask not to kill the code, so it’s up to Gadi to do that
    • “client.amm.start()” - (amm = active memory manager) can sometimes save some time, so worth adding this line
  • Dask dashboard
    • Dashboard windows that she likes to use: Worker memory, progress, task stream and scheduler system
  • Dask tells us how many chunks and graph layers are needed for any computation
    • Can use .visualize() - useful for small computations (e.g. 1 chunk in 3 graph layers), but not recommended for large computations. Better to just read the “Dask graph” line in the inline preview in the notebook
  • It takes time to communicate across tasks
  • .load() - returns the value and saves it in that variable, .compute() just returns the value, but doesn’t save the variable
    • .persist() shoves things onto individual cores
  • If running in a python script, need to include the client lines inside of a function or if statement - view COSIMA tutorial for specific syntax

Demo: using Dask with ACCESS-OM2-01 data

  • Loaded in chunks as they are stored on disk
  • Chunk sizes: rule of thumb, ~100MB to 1GB, but can vary a lot as to what is the optimum
  • Tip: can add %time at start of code line to time how long that line takes to run
  • Overhead to put pieces on each worker is a couple of milliseconds —> if tasks take a couple milliseconds, that is a sign that chunks are likely too small
  • If want to increase chunk size (common for netcdf)
    • Should not use .chunk() —> end up with even more little pieces
      • Loads in with wrong chunks, then rechunks and then does the math
      • Want to specify chunks when you read in the data
  • “parallel=True” - spreads across all workers. Doesn’t always make a difference, but can make things faster
  • After loading in new chunks, the individual tasks are taking ~0.5 secs, so better for parallelizing
  • What if we say we want no chunks when reading in the data?
    • Will hang for ages - is grabbing all of the ocean depth and then only doing the surface layer
  • If chunk sizes are too big, each worker will hold too much memory and will likely crash

Example of underlying code: salinity standard deviation

  • A lot of people say that the “xarray code is wrong”, but it actually comes down to how we compute variance
  • Can end up with last decimal place in the computation being slightly off if the two sums that you are taking a difference of are very large, and then the result is very small (issue is that not enough decimal places may be saved)
  • Example of how it’s important to know a little bit of what’s going on under the hood to know how code might behave

Black magic of chunking

  • Use multiples of on-disk chunking
  • Chunk when reading data in ideally, or sometimes if small dataset but big computation, can do after loading
  • Best to test with a small subset and watch the task stream —> when do full computation can see if something looks funny or different from when you ran the smaller subset
  • Dask uses ~2X the RAM of a dataset
  • Sometimes Dask is not helpful - likely because overhead for using Dask is too much
    • Sometimes faster to just shove small pieces into a for loop
  • Tip for writing to disk taking slow - often the problem is the calculating before writing to disk
    • If writing to Zarr, can work much better/faster than writing to netcdf
  • Tip is to use trial and error - read in data and then view native chunking before deciding how to read in optimal chunk sizes
1 Like