Hi! We’ve all talked about how to improve calculations that take ages, but I don’t think I’ve seen examples of that. So, here is a bit of my code that calculates time series of bottom speeds within a region for a handful of ~15yr long experiments (mom6-panan at 0.1deg resolution). I’m using monthly data, and so far I haven’t been able to complete the calculation. I’m working with 16ncpus.
Are you sure that the final calculation is actually happening in parallel?
You’re calling a NumPy function and passing the Dask array to it.
bottomsp[exp] = np.sqrt(u**2 + v**2)
The bottleneck is before that, when its loading u and v. After that the np.sqrt() takes nothing
I don’t have a feel for the size of this data / problem but those
.load() steps at line 38 & 40 might be where you hit memory issues?
A few other random thoughts that might be totally ignorant of what you are trying to accomplish or the context of your situation:
Maybe there is a good reason you are choosing to load the u & v objects into memory and then run a final calculation using NumPy?
Depending on data sizes and the spare storage you have a brute force option might be to write out & read back in temporary intermediate files ( and try Zarr? ) after you do some of the data reduction? I think @dougiesquire 's xstore might be useful there?
lastly and completely off topic you seem to be dealing with many versions (experiments) of a similar model(s) that have some hierarchical order? Wondering if something like datatree might be interesting to you? See also: Easy IPCC Part 1: Multi-Model Datatree | by Tom Nicholas | pangeo | Medium
My purpose is to plot the bottom speeds, so at some point stuff needs to be loaded into memory. I chose to load u & v because I was told it is best to load as soon as possible, because if you keep piling up xarray operations it would take forever. u & v are 3d, size [17x1000x800] approximately.
I don’t think its a memory issue because I am not running out of memory at all. Moreover, when complete, each bottom speed for each experiment is just 8.6kb. Its just the computation that is taking aaaaages. Anyways, maybe its not a good example to improve code. Thank you both
@JuliaN - I don’t think there is anything wrong with your example or the idea of trying to speed it up / get it to complete the calculation. Apologies if my random thoughts are less than helpful. They are meant - of course - very constructively.
Another silly question: is this dataset available on NCI if others wanted to troubleshoot the code?
Yeah of course! Not at all, I liked you suggestions One of the experiments is at /g/data/ik11/outputs/mom6-panan/panant-01-zstar-v13
Many have struggled with that kind of MOM output file structure before
I don’t know how many times you’ll be running calculations on such an experiment but if you have the spare storage it might be worth making a single temporary
zarr collection for all those
*.ocean_month_z.nc files. Looks like 272 x 5.4GB, so roughly a 1.5TB collection before the compression you’ll get (maybe 50%?).
Hi @JuliaN. There’s no obvious reason why your calculation shouldn’t be possible to parallelise efficiently with dask. I haven’t actually looked at your data, but a couple of thoughts/questions:
Chunking strategy is really important with large calculations like these. By default
open_mfdatasetwill open one file per chunk, which is possibly not well suited for your calculations. Something easy to try as a first pass: does adding
open_mfdataset()call make any noticeable difference?
Are you using the dask dashboard? If not, this could be helpful. If so, do you see calculations start and then take ages, or do calculations take ages to start when you do the
load()steps? The latter could indicate a very large/complex dask graph, which can possibly be simplified with appropriate chunking and some tweaks to your code.
Relatedly, it looks like you’re using
groupbyto calculate the annual means. I think using
coarsenshould give you a “cleaner” dask graph.
You’re grouping by and calculating a mean on time when opening the data:
data = xr.open_mfdataset(files, parallel = True, preprocess = preprocess).groupby(‘time.year’).mean(‘time’)
after you select the region in the preprocessing but before you’re selecting the depth
where(depth_array[‘z_l’] >= max_depth)
and before selecting the variables you actually need, i.e. you’re not using theta but you might still be calculating its mean.
time chunking is also very unfavourable for that kind of operation:
uo:_ChunkSizes = 1, 11, 121, 515 ;
And as Thomas suggested if you need to use the mean often it might be worth to just save it as in intermediary file, and then reload it. So you do that expensive operation only once.
Ohhh great! I’ll try these suggestions, seems like a lot of improvement Thanks!!
Incidentally, dask arrays now mostly support NEP-18 dispatching, meaning that you can call many numpy functions on dask arrays and they’ll return a dask array.
Also, depending on how far you get, pulling all the tidbits in this thread into an optimized workflow could be a task for the Cosima hackathon.
All of your suggestions worked and it is now quite faster particularly filtering variables out with the preprocessing! Thanks everyone