We ran the ‘scorep’ profiler for a mpi test run of the CABLE-POP model (aka the Canberra version). The profiler checks how much time the model spends in each subroutine. It showed that more than 80% of run time was used on communication between master and workers. Particularly the MPI_Recv subroutine seems to be very time consuming.
Several years ago I was profiling CABLE. I think it was pre-CMIP5 even, because of the reason I put it down. Even without POP the MPI drivers were the main culprits (as you also found). This didn’t surprise me a great deal as the IO at the front end is where it spends a huge chunk of time, and then it keeps coming back to the head node to collate all the fields. It might be nice to have a reliable measure of how many extra cores can increase performance before it all just gets swamped by network traffic.
So given I didn’t really learn anything, I profiled the serial model. The model is/was spending a huge chunk of time in “canopy”. No surprise I guess as there are two major loops there. It might be nice to improve on this but we’ve never had the time. Coupled in ACCESS the LSM only takes up 1 to a few percent of runtime. Hence I put the issue down. It’s likely even worse now as the UM has adopted ENDGAME dynamics, doubled atmospheric resolution, etc.