Notebook that takes too long

JuliaN · 19 January 2023 23:13

Hi! We’ve all talked about how to improve calculations that take ages, but I don’t think I’ve seen examples of that. So, here is a bit of my code that calculates time series of bottom speeds within a region for a handful of ~15yr long experiments (mom6-panan at 0.1deg resolution). I’m using monthly data, and so far I haven’t been able to complete the calculation. I’m working with 16ncpus.

gist.github.com

https://gist.github.com/julia-neme/fd43107b19a0c891daeffa9e10b66bcb

bottom_speed_timeseries

import cmocean
import cosima_cookbook.distributed as ccd
import dask.distributed as dsk
import glob
import matplotlib.gridspec as gs
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

import warnings # ignore these warnings

This file has been truncated. show original

edoddridge · 20 January 2023 00:42

Are you sure that the final calculation is actually happening in parallel?

You’re calling a NumPy function and passing the Dask array to it.

bottomsp[exp] = np.sqrt(u**2 + v**2)

JuliaN · 20 January 2023 00:47

The bottleneck is before that, when its loading u and v. After that the np.sqrt() takes nothing

Thomas-Moore · 20 January 2023 01:11

I don’t have a feel for the size of this data / problem but those .load() steps at line 38 & 40 might be where you hit memory issues?

A few other random thoughts that might be totally ignorant of what you are trying to accomplish or the context of your situation:

Maybe there is a good reason you are choosing to load the u & v objects into memory and then run a final calculation using NumPy?
Depending on data sizes and the spare storage you have a brute force option might be to write out & read back in temporary intermediate files ( and try Zarr? ) after you do some of the data reduction? I think @dougiesquire 's xstore might be useful there?
lastly and completely off topic you seem to be dealing with many versions (experiments) of a similar model(s) that have some hierarchical order? Wondering if something like datatree might be interesting to you? See also: Easy IPCC Part 1: Multi-Model Datatree | by Tom Nicholas | pangeo | Medium

JuliaN · 20 January 2023 01:40

My purpose is to plot the bottom speeds, so at some point stuff needs to be loaded into memory. I chose to load u & v because I was told it is best to load as soon as possible, because if you keep piling up xarray operations it would take forever. u & v are 3d, size [17x1000x800] approximately.

I don’t think its a memory issue because I am not running out of memory at all. Moreover, when complete, each bottom speed for each experiment is just 8.6kb. Its just the computation that is taking aaaaages. Anyways, maybe its not a good example to improve code. Thank you both

Thomas-Moore · 20 January 2023 02:02

@JuliaN - I don’t think there is anything wrong with your example or the idea of trying to speed it up / get it to complete the calculation. Apologies if my random thoughts are less than helpful. They are meant - of course - very constructively.

Another silly question: is this dataset available on NCI if others wanted to troubleshoot the code?

JuliaN · 20 January 2023 02:14

Yeah of course! Not at all, I liked you suggestions One of the experiments is at /g/data/ik11/outputs/mom6-panan/panant-01-zstar-v13

Thomas-Moore · 20 January 2023 02:40

Many have struggled with that kind of MOM output file structure before
yikes!

I don’t know how many times you’ll be running calculations on such an experiment but if you have the spare storage it might be worth making a single temporary zarr collection for all those *.ocean_month_z.nc files. Looks like 272 x 5.4GB, so roughly a 1.5TB collection before the compression you’ll get (maybe 50%?).

dougiesquire · 20 January 2023 03:01

Hi @JuliaN. There’s no obvious reason why your calculation shouldn’t be possible to parallelise efficiently with dask. I haven’t actually looked at your data, but a couple of thoughts/questions:

Chunking strategy is really important with large calculations like these. By default open_mfdataset will open one file per chunk, which is possibly not well suited for your calculations. Something easy to try as a first pass: does adding chunks=“auto” to your open_mfdataset() call make any noticeable difference?
Are you using the dask dashboard? If not, this could be helpful. If so, do you see calculations start and then take ages, or do calculations take ages to start when you do the load() steps? The latter could indicate a very large/complex dask graph, which can possibly be simplified with appropriate chunking and some tweaks to your code.
Relatedly, it looks like you’re using groupby to calculate the annual means. I think using resample or coarsen should give you a “cleaner” dask graph.

Paola-CMS · 20 January 2023 03:03

Adding on:

You’re grouping by and calculating a mean on time when opening the data:

data = xr.open_mfdataset(files, parallel = True, preprocess = preprocess).groupby(‘time.year’).mean(‘time’)

after you select the region in the preprocessing but before you’re selecting the depth
where(depth_array[‘z_l’] >= max_depth)
and before selecting the variables you actually need, i.e. you’re not using theta but you might still be calculating its mean.

time chunking is also very unfavourable for that kind of operation:

uo:_ChunkSizes = 1, 11, 121, 515 ;

And as Thomas suggested if you need to use the mean often it might be worth to just save it as in intermediary file, and then reload it. So you do that expensive operation only once.

Paola

JuliaN · 20 January 2023 03:09

Ohhh great! I’ll try these suggestions, seems like a lot of improvement Thanks!!

dougiesquire · 20 January 2023 03:26

Incidentally, dask arrays now mostly support NEP-18 dispatching, meaning that you can call many numpy functions on dask arrays and they’ll return a dask array.

Also, depending on how far you get, pulling all the tidbits in this thread into an optimized workflow could be a task for the Cosima hackathon.

JuliaN · 22 January 2023 22:47

All of your suggestions worked and it is now quite faster particularly filtering variables out with the preprocessing! Thanks everyone

Topic		Replies	Views
Latest version of dask (2022.11.0) could fix many workflow issues Technical python , dask	6	432	5 January 2023
Cosima cookbook updating needs Workshops cosima-workshop-2022	7	501	14 November 2022
Improving performance of a Python-based function General python , help , vectorisation	13	394	29 October 2023
Vertically integrated OHC with ACCESS-OM2-01 -- how to handle a lot of 3D data COSIMA dask , xarray , access-om2	16	302	2 June 2023
How to efficiently chunk data for faster processing and plotting? Technical python , cosima , access-om2	5	234	29 September 2024

Notebook that takes too long

Related topics