@ssilvest is working on further optimizing the distributed Oceananigans. We came across the following connundrum. When we distribute the model across several MPI processes there are a few processes are have only land on them (see eg the schematic below for a 48x16 ranks configuration).
So the processes that correspond to regions that are completely over land do not contribute anything. While we don’t compute any tendencies for the points on land, still, we do load the corresponding MPI processes that ends up allocating a GPU (or CPU) for those regions. And that GPU/CPU ends up doing no work at all!
If there was a way to avoid starting an MPI process on those regions it would help. But it seems like a complicated logistically since then we’d have to skip some MPI processes and renumber the ranks and the way communication is done…
Question: How do we handle this in ACCESS-OM2 – if we are handling it at all? Or we just ignore it and just launch MPI processes on land?
I’m afraid there’s no “simple” trick here. The solution is indeed to renumber ranks and change the way how you do communications. AFAIK, this is how it’s done in both MOM5 and MOM6.
But even that solution is not optimal, as you will still large load-imbalances because of the domains that intersect the coastline. To solve that, you need to allow for domain of arbitrary shapes. This increases the code complexity another notch, but based on my experience I would say it’s worth it. I don’t know if there are codes in fluid-dynamics that do this, but I’ve develop a finite-differences code that does precisely that, and its parallel scalability is extremely good.
The FMS-based models, i.e. ACCESS-OM2 and MOM6, rely on the land mask preprocessing to solve the problem of not starting a process which would lie completely on the land. There’s a lot of (probably significantly overcomplicated) logic within FMS itself to then handle the domain connectivity.
There are indeed unstructured mesh models, like FVCOM and MPAS-Ocean. I also think that would have been a potentially nicer way to go about things, especially with the wealth of libraries for handling domain decomposition on unstructured meshes, etc. I do wonder how it would impact the ease and efficiency of analysis however…
CICE has the ability to subdivide the computational domain horizontally into tiles (termed “blocks”), and then parallelise by allocating several blocks to each CPU. This can improve load balancing if a similar number of ice-containing and ice-free blocks are allocated to each CPU. Land-only blocks aren’t allocated to a CPU at all. See Craig et al., (2015).