Seeking help from MPI+GPU pros!

Hi all,

I am currently running a multi-GPU ocean model (ClimaOcean) and am getting very poor scaling as I increase the number of GPUs. I am investigating the performance using NVIDIA NSYS profiles, but am struggling to interpret the potential cause of the poor scaling.

I thought I would reach out here on the off-chance that there is an MPI+GPU pro in the community who can help me diagnose the cause of the poor scaling - and interpret the NVIDIA profiles.

Hi Taimor, someone that might be able to help is @JorgeG94, who works at NCI and is an expert on this subject.

2 Likes

Taimor, could you also share some of your results? e.g. screenshots of your nsys profiling output and scaling results.

A common culprit is that GPU communications are going through CPU and not direct GPU-to-GPU.

Hello! If you’ve already collected profiles using the Nvidia profilers, I’d be great if you can share them with me somehow. My email is jorge.galvezvallejo@anu.edu.au in case uploading things here is dumb.

From just first principles, bad scaling will be due to two things: not enough work or horrendous communication overhead. One is easier to test out, if you increase the problem size (create more work) you might see better scaling.

I am also assuming here you mean strong scaling.

Communication, you could look into if you’re using a gpu aware mpi library. And if your code uses GPU aware communication. Otherwise, when two gpus communicate you need to copy device to host and from.host to the other device you’re interested in.

That could kill your performance.

Amazing! thanks for the help everyone!

The NSYS files are here. The latest attempt is the two ranks which have filenames *2GPUs*UCX*nsys. Annoyingly I need to provide access via email, but I have done it for those who have responded here.

The GitHub discussion regarding this issue is here. We have determined that this is an issue which may be unique to Gadi, as the scaling is quite robust in the MIT machine.

Ah yes, gadi could be the issue! Can you provide me with the environment you used to build? As in compiler version, MPI version, etc?

Since this is a Julia app the way you’re binding or piggybacking on the MPI might be the main issue.

julia> MPI.versioninfo()
MPIPreferences:
  binary:  system
  abi:     OpenMPI
  libmpi:  libmpi
  mpiexec: mpiexec

Package versions
  MPI.jl:             0.20.23
  MPIPreferences.jl:  0.1.11

Library information:
  libmpi:  libmpi
  libmpi dlpath:  /apps/openmpi/4.1.7/lib/libmpi.so
  MPI version:  3.1.0
  Library version:  
    Open MPI v4.1.7, package: Open MPI apps@gadi-cpu-clx-1415.gadi.nci.org.au Distribution, ident: 4.1.7, repo rev: v4.1.7, Oct 31, 2024
  MPI launcher: mpiexec
  MPI launcher path: /apps/openmpi/4.1.7/bin/mpiexec

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.6, local installation
- driver 575.57.8 for 12.9
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 12.6.0)
- NVML: 12.0.0+575.57.8

Julia packages: 
- CUDA: 5.8.5
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.10.10
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none 
- JULIA_CUDA_USE_BINARYBUILDER: false

Preferences:
- CUDA_Runtime_jll.version: 12.6
- CUDA_Runtime_jll.local: true

3 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.729 GiB / 32.000 GiB available)

Can you somehow try openmpi 5.*? I’m in a conference so I’m not going to be able to see your profiles until the night!!

Thanks @JorgeG94 for looking into this!

The profiles are big! The conference internet is downloading them very slowly.

On Gadi I had some troubles getting multi GPU profiles without setting export TMPDIR=$MYSCRATCH/tmp where $MYSCRATCH is just the path to your scratch directory. I see that you have n_gpus nsys-rep files. This is cool but a better way to get more information is if you do:

nsys profile –stats=true mpirun -np 2 ./a.out

This will lead to a profile that has both GPUs on the same timeline, which is way more useful to diagnose this problem. If you try this on Gadi you will probably run into the issue where the profiler will complain and crash. That’s when you set the TMPDIR variable to help mitigate this. If this does not work…then I can raise a question inside the MPI.

I could try to reproduce it too if you can provide me with some configurations and a sample run so that I can have a go at it.

I’m leaving today for a 2.5 week holiday but @navidcy can hopefully continue to push this, as it would be great to get these models running on gadi!

happy to zoom about this @JorgeG94, perhaps when you are back from the conference?

Oh god. This escaped me completely. We can meet next week and @taimoorsohail will be back from holidays.

Sorry about this!!! :frowning: