Seeking help from MPI+GPU pros!

taimoorsohail · 22 October 2025 01:44

Hi all,

I am currently running a multi-GPU ocean model (ClimaOcean) and am getting very poor scaling as I increase the number of GPUs. I am investigating the performance using NVIDIA NSYS profiles, but am struggling to interpret the potential cause of the poor scaling.

I thought I would reach out here on the off-chance that there is an MPI+GPU pro in the community who can help me diagnose the cause of the poor scaling - and interpret the NVIDIA profiles.

micael · 22 October 2025 03:01

Hi Taimor, someone that might be able to help is @JorgeG94, who works at NCI and is an expert on this subject.

edoyango · 22 October 2025 03:04

Taimor, could you also share some of your results? e.g. screenshots of your nsys profiling output and scaling results.

A common culprit is that GPU communications are going through CPU and not direct GPU-to-GPU.

JorgeG94 · 22 October 2025 03:08

Hello! If you’ve already collected profiles using the Nvidia profilers, I’d be great if you can share them with me somehow. My email is jorge.galvezvallejo@anu.edu.au in case uploading things here is dumb.

From just first principles, bad scaling will be due to two things: not enough work or horrendous communication overhead. One is easier to test out, if you increase the problem size (create more work) you might see better scaling.

I am also assuming here you mean strong scaling.

Communication, you could look into if you’re using a gpu aware mpi library. And if your code uses GPU aware communication. Otherwise, when two gpus communicate you need to copy device to host and from.host to the other device you’re interested in.

That could kill your performance.

taimoorsohail · 22 October 2025 04:03

Amazing! thanks for the help everyone!

The NSYS files are here. The latest attempt is the two ranks which have filenames *2GPUs*UCX*nsys. Annoyingly I need to provide access via email, but I have done it for those who have responded here.

The GitHub discussion regarding this issue is here. We have determined that this is an issue which may be unique to Gadi, as the scaling is quite robust in the MIT machine.

JorgeG94 · 22 October 2025 04:05

Ah yes, gadi could be the issue! Can you provide me with the environment you used to build? As in compiler version, MPI version, etc?

Since this is a Julia app the way you’re binding or piggybacking on the MPI might be the main issue.

taimoorsohail · 22 October 2025 04:12

julia> MPI.versioninfo()
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.23
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/openmpi/4.1.7/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v4.1.7, package: Open MPI apps@gadi-cpu-clx-1415.gadi.nci.org.au Distribution, ident: 4.1.7, repo rev: v4.1.7, Oct 31, 2024
  MPI launcher: mpiexec
  MPI launcher path: /apps/openmpi/4.1.7/bin/mpiexec

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.6, local installation
- driver 575.57.8 for 12.9
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+575.57.8

Julia packages: 
- CUDA: 5.8.5
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none 
- JULIA_CUDA_USE_BINARYBUILDER: false

Preferences:
- CUDA_Runtime_jll.version: 12.6
- CUDA_Runtime_jll.local: true

3 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)

JorgeG94 · 22 October 2025 04:35

Can you somehow try openmpi 5.*? I’m in a conference so I’m not going to be able to see your profiles until the night!!

navidcy · 22 October 2025 05:14

Thanks @JorgeG94 for looking into this!

JorgeG94 · 22 October 2025 06:05

The profiles are big! The conference internet is downloading them very slowly.

On Gadi I had some troubles getting multi GPU profiles without setting export TMPDIR=$MYSCRATCH/tmp where $MYSCRATCH is just the path to your scratch directory. I see that you have n_gpus nsys-rep files. This is cool but a better way to get more information is if you do:

nsys profile –stats=true mpirun -np 2 ./a.out

This will lead to a profile that has both GPUs on the same timeline, which is way more useful to diagnose this problem. If you try this on Gadi you will probably run into the issue where the profiler will complain and crash. That’s when you set the TMPDIR variable to help mitigate this. If this does not work…then I can raise a question inside the MPI.

I could try to reproduce it too if you can provide me with some configurations and a sample run so that I can have a go at it.

taimoorsohail · 22 October 2025 06:31

I’m leaving today for a 2.5 week holiday but @navidcy can hopefully continue to push this, as it would be great to get these models running on gadi!

navidcy · 22 October 2025 06:40

happy to zoom about this @JorgeG94, perhaps when you are back from the conference?

JorgeG94 · 7 November 2025 04:41

Oh god. This escaped me completely. We can meet next week and @taimoorsohail will be back from holidays.

Sorry about this!!!

Topic		Replies	Views
CABLE profiling results CABLE mpi , bug	1	277	4 April 2023
Installing ACCESS-OM2 on NeSI (New Zealand supercomputer) COSIMA help , access-om2 , technical , community-help	35	211	18 December 2025
ACCESS-OM2-025 + WOMBAT performance and scaling on Gadi COSIMA	4	277	28 October 2024
MOM6 NIC_SYMMETRY_TEST error COSIMA payu , mom6 , pawsey , mpich , environment	11	393	19 October 2023
MPI distribution for an ocean model: what's the best way to deal with processes that are completely on land? COSIMA cosima , knowledge-base , model	6	289	30 July 2023

Seeking help from MPI+GPU pros!

Related topics