See this:
TLDR; set your inner domain CPUs to 16,12.
Changing
{% set UM_ATM_OMP=1 %}
to
{% set UM_ATM_OMP=2 %}
Is an optimisation trick - it’s allowing the UM to use OMP (Open Message Passing) to parallelise loops using shared memory between two cores. This gives a slight performance increase, but it does’t fix the domain decomposition problem you hit when you try and use too many processors on a small-ish domain.
NPROC restriction is because there’s a minimum grid size for each process, e.g. halo of cells going to the north MPI rank can’t overlap with the halo of cells going to the south MPI rank."
Are you only using 450x450 for your d0198 nest?