Description: Xarray and Dask are increasingly becoming key tools for ocean modellers. However, both vectorisation with xarray and lazy computing with dask remain an enigma to many. This session will explain what happens under the hood of these libraries and provide tips to optimize your code performance. We’ll also walk through the logical steps to transform complicated analyses from for-loops into efficient vectorised functions, using real-world oceanographic examples.
Prerequisites
Basic familiarity with Python, Xarray, Dask and how to load large datasets.
Hopefully these are helpful! I did my best to capture most topics discussed - apologies for any incorrect or unfinished thoughts.
Introduction: Xarray and Dask
They underpin all analysis you do of model output in climate
But both a bit of an enigma
Often get code from others, and then difficult to troubleshoot when those lines of code stop working
This session: what’s going on under the hood so people know how to start troubleshooting problems
Aim: write faster/better code
What does it mean to write good/fast code?
What is “lazy” computing? What is chunking?
Most examples will be for averaging, at the end an example will be worked through with more steps
Why bother to write faster/better code?
Saves time
If spend 5 minutes thinking, will save 20 minutes of writing code
Don’t have to always write nicer code - can be faster to write a hack solution sometimes
Does more
If can write something that runs 10 times faster, can do 10 times more
If can save storage space, can also do more
Makes something work at all
Code demo using numpy and time libraries
Example showing how much time to compute 1+1 once vs running that over 10^6 eg
Even a small computation takes a while if run over and over
You can time different sections of code
Python backends to C
There is a time associated with passing Python onto C —> useful to replace loops with vectorization if possible (though the vectorization still does take time)
Example: was taking long to load the data rather than the computation within each for loop —> she swapped her loops around to minimize the number of times she had to load in of data
Code demo using Xarray
Loads labeled dimensions and variables
It’s a wrapper for a Numpy or Dask array
Dask array: “a numpy array waiting to happen”
Lazy computing
~ “do it later”
Computer keeps track of what you want to do, but doesn’t actually do it until you explicitly say .load() or .compute()
Dask breaks arrays up into smaller pieces
Eg summing in different chunks - sum over each chunk, and then sum over all chunks
Demo: using Dask
Need to use a client from the Dask library (best to copy/paste from her notebook)
Memory limit = 0 is asking Dask not to kill the code, so it’s up to Gadi to do that
“client.amm.start()” - (amm = active memory manager) can sometimes save some time, so worth adding this line
Dask dashboard
Dashboard windows that she likes to use: Worker memory, progress, task stream and scheduler system
Dask tells us how many chunks and graph layers are needed for any computation
Can use .visualize() - useful for small computations (e.g. 1 chunk in 3 graph layers), but not recommended for large computations. Better to just read the “Dask graph” line in the inline preview in the notebook
It takes time to communicate across tasks
.load() - returns the value and saves it in that variable, .compute() just returns the value, but doesn’t save the variable
.persist() shoves things onto individual cores
If running in a python script, need to include the client lines inside of a function or if statement - view COSIMA tutorial for specific syntax
Demo: using Dask with ACCESS-OM2-01 data
Loaded in chunks as they are stored on disk
Chunk sizes: rule of thumb, ~100MB to 1GB, but can vary a lot as to what is the optimum
Tip: can add %time at start of code line to time how long that line takes to run
Overhead to put pieces on each worker is a couple of milliseconds —> if tasks take a couple milliseconds, that is a sign that chunks are likely too small
If want to increase chunk size (common for netcdf)
Should not use .chunk() —> end up with even more little pieces
Loads in with wrong chunks, then rechunks and then does the math
Want to specify chunks when you read in the data
“parallel=True” - spreads across all workers. Doesn’t always make a difference, but can make things faster
After loading in new chunks, the individual tasks are taking ~0.5 secs, so better for parallelizing
What if we say we want no chunks when reading in the data?
Will hang for ages - is grabbing all of the ocean depth and then only doing the surface layer
If chunk sizes are too big, each worker will hold too much memory and will likely crash
Example of underlying code: salinity standard deviation
A lot of people say that the “xarray code is wrong”, but it actually comes down to how we compute variance
Can end up with last decimal place in the computation being slightly off if the two sums that you are taking a difference of are very large, and then the result is very small (issue is that not enough decimal places may be saved)
Example of how it’s important to know a little bit of what’s going on under the hood to know how code might behave
Black magic of chunking
Use multiples of on-disk chunking
Chunk when reading data in ideally, or sometimes if small dataset but big computation, can do after loading
Best to test with a small subset and watch the task stream —> when do full computation can see if something looks funny or different from when you ran the smaller subset
Dask uses ~2X the RAM of a dataset
Sometimes Dask is not helpful - likely because overhead for using Dask is too much
Sometimes faster to just shove small pieces into a for loop
Tip for writing to disk taking slow - often the problem is the calculating before writing to disk
If writing to Zarr, can work much better/faster than writing to netcdf
Tip is to use trial and error - read in data and then view native chunking before deciding how to read in optimal chunk sizes