I have written a Python library called atmos for performing thermodynamic calculations and deriving various convective parameters. The code works with numpy arrays and is designed to perform calculations in a vectorised manner. In particular, it permits vectorised calculation of quantities such as CAPE and CIN. Other similar libraries such as MetPy can only perform these calculations on 1D vertical profiles rather than multidimensional arrays (e.g., metpy.calc.cape_cin). Vectorisation of these calculations in atmos is achieved by making extensive use of Boolean indexing (see for example the function parcel.parcel_ascent).
The problem
I want to apply atmos to some reanalysis and climate model output, which comprises large four-dimension (time x level x latitude x longitude) arrays. To make the calculations performant, I was hoping to use xarray/dask to split the data into chunks and process these in parallel. Based on the advice of a colleague, I have been attempting to do this using xarrayâs apply_ufunc method. However, Iâm struggling to get any performance benefit and am often running into issues where Python / Jupyter just hangs indefinitely. Historically, Iâve never had much success with using xarray for parallel processing so have tended to just revert to using numpy arrays with multiprocessing. However, the arrays Iâm working with here are so big that Iâm not sure this is feasible. Processing large datasets using xarray always seems to work well in the tutorials (such as this recent one from the ACCESS-NRI team) and it would be great to be able to utilise this functionality in my own work.
Next steps
Having spent the last few days struggling with this, I figured it was time to ask for some help. If anyone has any advice on the use of xr.apply_ufunc, or more generally on parallelising numpy-enabled functions, I would be extremely grateful. Even better, if someone has time, it would be great to be able to talk through the code via video call as Iâm sure this would lead to a more efficient diagnosis of the issues.
If I understand correctly, your existing code is entirely vectorised and works with numpy arrays, but you would like to apply it to larger datasets that donât fit into memory? In which case, dask is a good candidate to help.
Given the code is already vectorised, we can probably get it to work with dask/xarray with minimal changes. e.g. something like np.mean(x, axis=2) or z = x**2 will work both if x is a numpy array (your current code) or if x is an xarray DataArray, including parallelisation. Typically, the biggest problem with this kind of conversion would be chunk sizes, although the boolean indexing you mention might mean that the full boolean index is loaded into memory (even though it might be too big) and in your parcel.parcel_ascent function the np.zeros lines could also cause similar problems. So you might have to comb through and make sure nothingâs secretly using âeagerâ (non-lazy, loaded into memory) arrays.
The other approach is, as you say, to use xarray apply_ufunc to wrap around your function that uses numpy arrays. This example in COSIMA recipes I think is the same style of thing youâre trying to do, so might be helpful?
I guess the other thing Iâd say is that in situations where data fits into memory, well written code working with numpy arrays will generally outperform well written code working with xarray/dask arrays â the advantage of xarray and dask comes from being able to work with datasets too big to fit into memory. So as long as you donât get a major performance hit with xarray/dask on small samples I donât think itâs a problem. We should be able to fix the hanging issue here though, which sounds (at first glance) like too small or too big chunks.
Iâm happy to zoom sometime next week and talk through converting an example function, it sounds like youâre most of the way there and are just getting caught up on some of the smaller technicalities. If this would help, DM me and we can find a time.
Many thanks for your prompt reply to my query. I have spent a bit of time over the last few days doing some profiling of the atmos code running directly on numpy arrays. In doing so, I found I was able to achieve some quite significant performance improvements by modifying a couple of key functions to operate on scalars rather than numpy arrays, with numba.vectorize used to vectorize them. I was rather surprised by this. It got me wondering whether the design philosophy I have adopted for atmos of trying to make all the functions vectorised is actually not the best approach. Perhaps it would be better to have functions like parcel.parcel_ascent work on 1D vertical profiles and then use numba.guvectorize to vectorize across the other array dimensions. Making such a change could be a lot of work, however, so I donât want to go down this route unless I have confidence that it would lead to significant performance gains. Do you have any thoughts on this? Is Python code compiled and vectorized using numba likely to be significantly faster than pure Python code using numpy arrays?
Given the reasonable performance I am getting now with numpy arrays, I am wondering whether I need to use xarray / dask at all. I could instead using Pythonâs multiprocessing library to parallelise over time (e.g., over days or month) - although I note that I have also run into issues with hanging when using multiprocessing in the past (particularly in Jupyter notebooks). Irrespective, it would be good to get a better understanding of where I am currently going wrong with xarray so that I can make better use of it (and dask) for parallelisation in the future. With this in mind, I would definitely like to take you up on the offer of chatting further via Zoom. Unfortunately, the rest of this week is looking pretty hectic so it would probably have to be sometime next week. Iâll contact you directly to try and arrange a suitable time.
Itâs great to hear that youâre working on this stuff! Since it sounds like youâre making some progress and @jemmajeffree has some useful insights, Iâm going to mark this topic with the community-help tag, for now. If you have any additional questions, or need any specific help that ACCESS-NRI might be able to provide, please reach out again by making a new topic!
Yeah, me too. I wonder if itâs a memory thing or actually a CPU-time thing. If by âoperate on scalars rather than numpy arraysâ you mean having a variable x=5 and broadcasting instead of having a variable x = # some big numpy array of 5s in which case that makes sense.
Agreed, my gut feel is that you wouldnât get too much out of it, but then I havenât used numba to vectorise and I was surprised by the result above so maybe we donât trust my gut here.
I canât see why it would be. Iâd be curious if you could work back from the faster from your one test case code to produce equally fast code with numpy functions you didnât realise existed except for having your hand forced in the translation.
xarray/dask help with labelling (can make code easier to read) and with too-big-to-fit-into-memory data (if you need to deal with subsets). Otherwise, you can do whatever you wanted to do with numpy or other libraries
Agreed, there are some situations in which itâs a great help
Numba will jit compile the entire function, so I think it should have less interactions with the python interpreter, and will in general be faster.
I think the way to think of this is that the entire function that has been decorated with @numba.vectorize is now akin to a single numpy ufunc.
So something like
x = np.arange(-1, 1, 0.001)
y = np.arange(-1, 1, 0.001)
r = np.sqrt( np.sin(x) **2 + np.cos(y) **2)
(ignore the nonsense maths, my brain is fried right now!)
has to interact with the python interpreter on a few separate occasions: to call np.sqrt, np.sin, np.pow (thatâs what **2 does), and np.cos and np.pow again.
If you define something like that as a simple function, ie.
@numba.vectorise
def radius(x, y):
return (sin(x)**2 + cos(y)**2
x = np.arange(-1, 1, 0.001)
y = np.arange(-1, 1, 0.001)
radius(x,y)
there is only one interaction with the python interpreter - the vectorised function has all been compiled and so the one Python interpreter interaction is the function call itself.
In general, I think vectorising with numba will be faster than using numpy, and it will probably scale with the number of np.made_up_numpy_function calls.
Thank for the insight, Charles. This is consistent with some stuff I was reading about numba last night online (in particular, this series of blog posts). It sounds like it may very well be worth me trying to make more extensive use of numba in atmos. Hopefully Iâll have some time next week to start working on the changes.
One crucial thing I conveniently forgot is that there is an overhead associated with JIT compiling code. So generally, it is advised to only JIT compile if you have functions that you:
Plan on reusing a lot (ie. youâre gonna call it hundreds of times). In this case, the savings per function call * number of calls outweigh the overhead of compiling once.
Takes a while to run. In this case, the time savings per call can outweigh the compilation overhead.
If you ever play around with writing Julia code, you might experience the feeling of âthey told me this language is fast, why is everything so slow?!?â. Julia performs this JIT compilation by default (in fact, I donât think it can be disabled, itâs part of the language design) , and it leads to something that the Julia community tends to call the âtime to first plot problemâ. Itâs essentially the same problem, and worth a bit of googling if youâre really interested in JIT compilation.
One more tidbit before I go back to what I should be doing - numba also exposes an @cuda.jit decorator, which will let you hardware accelerate calculations using a GPU.
Iâve been meaning to look into whether:
a. Itâs possible to set up an ARE session with a GPU and
b. whether GPU acceleration will be better than regular JIT compiled code targeted for the CPU for climate-y workflows.
I went to a fantastic tutorial at Scipy this year on this stuff: video recording of the tutorial here. I think there should be a github repo with the tutorial content somewhere too - I just canât find it right now.
If you do decide to take a deep(er) dive into this stuff, please let me know what you find! Also happy to try to lend some technical expertise (if I have it, no guarantees there though ) on the hardware acceleration side - I think itâs something we ought to be exploring.
This is a very similar problem to what I was trying to address with my xarray_parcel library ( GitHub - traupach/xarray_parcel: xarray-enabled atmospheric air parcel calculations ). This library is a re-write of MetPyâs 1D parcel calculations to allow for parallel processing over chunked xarray data. Itâs not perfect but it does achieve a significant speed-up over the MetPy functions, and it uses apply_ufuncfor the parallel bits. The slow part is calculating adiabats, for which I used a slightly-less-exact lookup table, but I think you have a nice analytical solution to that problem. I also experimented with numba but didnât manage to get it to significantly speed things up.
Rob I know youâre aware of xarray_parcel so you may have already looked and found the techniques I used not to apply to your situation â but in any case Iâll be interested to follow on with the speed-ups you manage with Jemma and Charlesâ insights.