Running complex numpy-based code in parallel with xarray/dask

Background

I have written a Python library called atmos for performing thermodynamic calculations and deriving various convective parameters. The code works with numpy arrays and is designed to perform calculations in a vectorised manner. In particular, it permits vectorised calculation of quantities such as CAPE and CIN. Other similar libraries such as MetPy can only perform these calculations on 1D vertical profiles rather than multidimensional arrays (e.g., metpy.calc.cape_cin). Vectorisation of these calculations in atmos is achieved by making extensive use of Boolean indexing (see for example the function parcel.parcel_ascent).

The problem

I want to apply atmos to some reanalysis and climate model output, which comprises large four-dimension (time x level x latitude x longitude) arrays. To make the calculations performant, I was hoping to use xarray/dask to split the data into chunks and process these in parallel. Based on the advice of a colleague, I have been attempting to do this using xarray’s apply_ufunc method. However, I’m struggling to get any performance benefit and am often running into issues where Python / Jupyter just hangs indefinitely. Historically, I’ve never had much success with using xarray for parallel processing so have tended to just revert to using numpy arrays with multiprocessing. However, the arrays I’m working with here are so big that I’m not sure this is feasible. Processing large datasets using xarray always seems to work well in the tutorials (such as this recent one from the ACCESS-NRI team) and it would be great to be able to utilise this functionality in my own work.

Next steps

Having spent the last few days struggling with this, I figured it was time to ask for some help. If anyone has any advice on the use of xr.apply_ufunc, or more generally on parallelising numpy-enabled functions, I would be extremely grateful. Even better, if someone has time, it would be great to be able to talk through the code via video call as I’m sure this would lead to a more efficient diagnosis of the issues.

Cheers,
Rob

1 Like

Hi Rob,

Thanks for the clear description of the problem

If I understand correctly, your existing code is entirely vectorised and works with numpy arrays, but you would like to apply it to larger datasets that don’t fit into memory? In which case, dask is a good candidate to help.

Given the code is already vectorised, we can probably get it to work with dask/xarray with minimal changes. e.g. something like np.mean(x, axis=2) or z = x**2 will work both if x is a numpy array (your current code) or if x is an xarray DataArray, including parallelisation. Typically, the biggest problem with this kind of conversion would be chunk sizes, although the boolean indexing you mention might mean that the full boolean index is loaded into memory (even though it might be too big) and in your parcel.parcel_ascent function the np.zeros lines could also cause similar problems. So you might have to comb through and make sure nothing’s secretly using “eager” (non-lazy, loaded into memory) arrays.

The other approach is, as you say, to use xarray apply_ufunc to wrap around your function that uses numpy arrays. This example in COSIMA recipes I think is the same style of thing you’re trying to do, so might be helpful?

I guess the other thing I’d say is that in situations where data fits into memory, well written code working with numpy arrays will generally outperform well written code working with xarray/dask arrays — the advantage of xarray and dask comes from being able to work with datasets too big to fit into memory. So as long as you don’t get a major performance hit with xarray/dask on small samples I don’t think it’s a problem. We should be able to fix the hanging issue here though, which sounds (at first glance) like too small or too big chunks.

I’m happy to zoom sometime next week and talk through converting an example function, it sounds like you’re most of the way there and are just getting caught up on some of the smaller technicalities. If this would help, DM me and we can find a time.

Jemma

1 Like

Hi Jemma,

Many thanks for your prompt reply to my query. I have spent a bit of time over the last few days doing some profiling of the atmos code running directly on numpy arrays. In doing so, I found I was able to achieve some quite significant performance improvements by modifying a couple of key functions to operate on scalars rather than numpy arrays, with numba.vectorize used to vectorize them. I was rather surprised by this. It got me wondering whether the design philosophy I have adopted for atmos of trying to make all the functions vectorised is actually not the best approach. Perhaps it would be better to have functions like parcel.parcel_ascent work on 1D vertical profiles and then use numba.guvectorize to vectorize across the other array dimensions. Making such a change could be a lot of work, however, so I don’t want to go down this route unless I have confidence that it would lead to significant performance gains. Do you have any thoughts on this? Is Python code compiled and vectorized using numba likely to be significantly faster than pure Python code using numpy arrays?

Given the reasonable performance I am getting now with numpy arrays, I am wondering whether I need to use xarray / dask at all. I could instead using Python’s multiprocessing library to parallelise over time (e.g., over days or month) - although I note that I have also run into issues with hanging when using multiprocessing in the past (particularly in Jupyter notebooks). Irrespective, it would be good to get a better understanding of where I am currently going wrong with xarray so that I can make better use of it (and dask) for parallelisation in the future. With this in mind, I would definitely like to take you up on the offer of chatting further via Zoom. Unfortunately, the rest of this week is looking pretty hectic so it would probably have to be sometime next week. I’ll contact you directly to try and arrange a suitable time.

Cheers,
Rob

1 Like

Hi @robwarren:

It’s great to hear that you’re working on this stuff! Since it sounds like you’re making some progress and @jemmajeffree has some useful insights, I’m going to mark this topic with the community-help tag, for now. If you have any additional questions, or need any specific help that ACCESS-NRI might be able to provide, please reach out again by making a new topic!

All the best,

Lawrence

1 Like

Yeah, me too. I wonder if it’s a memory thing or actually a CPU-time thing. If by “operate on scalars rather than numpy arrays” you mean having a variable x=5 and broadcasting instead of having a variable x = # some big numpy array of 5s in which case that makes sense.

Agreed, my gut feel is that you wouldn’t get too much out of it, but then I haven’t used numba to vectorise and I was surprised by the result above so maybe we don’t trust my gut here.

I can’t see why it would be. I’d be curious if you could work back from the faster from your one test case code to produce equally fast code with numpy functions you didn’t realise existed except for having your hand forced in the translation.

xarray/dask help with labelling (can make code easier to read) and with too-big-to-fit-into-memory data (if you need to deal with subsets). Otherwise, you can do whatever you wanted to do with numpy or other libraries

Agreed, there are some situations in which it’s a great help

Numba will jit compile the entire function, so I think it should have less interactions with the python interpreter, and will in general be faster.

I think the way to think of this is that the entire function that has been decorated with @numba.vectorize is now akin to a single numpy ufunc.

So something like

x = np.arange(-1, 1, 0.001)
y = np.arange(-1, 1, 0.001)

r = np.sqrt( np.sin(x) **2 + np.cos(y) **2) 

(ignore the nonsense maths, my brain is fried right now!)
has to interact with the python interpreter on a few separate occasions: to call np.sqrt, np.sin, np.pow (that’s what **2 does), and np.cos and np.pow again.

If you define something like that as a simple function, ie.

@numba.vectorise
def radius(x, y):
   return (sin(x)**2 + cos(y)**2

x = np.arange(-1, 1, 0.001)
y = np.arange(-1, 1, 0.001)

radius(x,y)

there is only one interaction with the python interpreter - the vectorised function has all been compiled and so the one Python interpreter interaction is the function call itself.

In general, I think vectorising with numba will be faster than using numpy, and it will probably scale with the number of np.made_up_numpy_function calls.

2 Likes

Whoaaaaaa! Thank you for the new insight

Thank for the insight, Charles. This is consistent with some stuff I was reading about numba last night online (in particular, this series of blog posts). It sounds like it may very well be worth me trying to make more extensive use of numba in atmos. Hopefully I’ll have some time next week to start working on the changes.

You’re welcome!

One crucial thing I conveniently forgot is that there is an overhead associated with JIT compiling code. So generally, it is advised to only JIT compile if you have functions that you:

  1. Plan on reusing a lot (ie. you’re gonna call it hundreds of times). In this case, the savings per function call * number of calls outweigh the overhead of compiling once.
  2. Takes a while to run. In this case, the time savings per call can outweigh the compilation overhead.

If you ever play around with writing Julia code, you might experience the feeling of “they told me this language is fast, why is everything so slow?!?”. Julia performs this JIT compilation by default (in fact, I don’t think it can be disabled, it’s part of the language design) , and it leads to something that the Julia community tends to call the ‘time to first plot problem’. It’s essentially the same problem, and worth a bit of googling if you’re really interested in JIT compilation.

1 Like

One more tidbit before I go back to what I should be doing - numba also exposes an @cuda.jit decorator, which will let you hardware accelerate calculations using a GPU.

I’ve been meaning to look into whether:
a. It’s possible to set up an ARE session with a GPU and
b. whether GPU acceleration will be better than regular JIT compiled code targeted for the CPU for climate-y workflows.

I went to a fantastic tutorial at Scipy this year on this stuff: video recording of the tutorial here. I think there should be a github repo with the tutorial content somewhere too - I just can’t find it right now.

If you do decide to take a deep(er) dive into this stuff, please let me know what you find! Also happy to try to lend some technical expertise (if I have it, no guarantees there though :sweat_smile:) on the hardware acceleration side - I think it’s something we ought to be exploring.

1 Like

Hi Rob, hi all,

This is a very similar problem to what I was trying to address with my xarray_parcel library ( GitHub - traupach/xarray_parcel: xarray-enabled atmospheric air parcel calculations ). This library is a re-write of MetPy’s 1D parcel calculations to allow for parallel processing over chunked xarray data. It’s not perfect but it does achieve a significant speed-up over the MetPy functions, and it uses apply_ufuncfor the parallel bits. The slow part is calculating adiabats, for which I used a slightly-less-exact lookup table, but I think you have a nice analytical solution to that problem. I also experimented with numba but didn’t manage to get it to significantly speed things up.

Rob I know you’re aware of xarray_parcel so you may have already looked and found the techniques I used not to apply to your situation – but in any case I’ll be interested to follow on with the speed-ups you manage with Jemma and Charles’ insights.

Cheers,
Tim

2 Likes