Latest version of dask (2022.11.0) could fix many workflow issues

The latest version of dask (2022.11.0) uses a new mode of scheduling by default that can significantly reduce the memory usage in a lot of typical geoscience workflows. The new default tries to address the problem of “root task overproduction” - if you’ve ever inexplicably run out of memory while computing a climatology, for example, you were probably experiencing root task overproduction. Details here: Reducing memory usage in Dask workloads by 80%.

It would be great to explore/benchmark any improvements from this to COSIMA workflows as part of the COSIMA hackathon.

1 Like

Nice! I’d be really interested to see how this pans out. It sounds like there’s the potential for some slight slowdown as a tradeoff for the reduction in memory usage, but also that it can speed things up by not spilling to disk.

Good idea. I like the idea of bench-marking the versions to see how that plays out.

I have started an issue to get this installed in the hh5 conda environment

Latest dask version is now available in conda/analysis3-unstable:

$ conda list dask
# packages in environment at /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.07:
# Name                    Version                   Build  Channel
dask                      2022.11.1          pyhd8ed1ab_0    conda-forge

Ok big data analysts (@navidcy @adele157 @AndyHoggANU). Give it your best shot

cc: @claireyung and @polinash

Here’s a blog post on how this development came about and what we can learn from it: Dask.distributed and Pangeo: Better performance for everyone thanks to science / software collaboration | by Tom Nicholas | pangeo | Jan, 2023 | Medium

1 Like