I am currently running a multi-GPU ocean model (ClimaOcean) and am getting very poor scaling as I increase the number of GPUs. I am investigating the performance using NVIDIA NSYS profiles, but am struggling to interpret the potential cause of the poor scaling.
I thought I would reach out here on the off-chance that there is an MPI+GPU pro in the community who can help me diagnose the cause of the poor scaling - and interpret the NVIDIA profiles.
Hello! If you’ve already collected profiles using the Nvidia profilers, I’d be great if you can share them with me somehow. My email is jorge.galvezvallejo@anu.edu.au in case uploading things here is dumb.
From just first principles, bad scaling will be due to two things: not enough work or horrendous communication overhead. One is easier to test out, if you increase the problem size (create more work) you might see better scaling.
I am also assuming here you mean strong scaling.
Communication, you could look into if you’re using a gpu aware mpi library. And if your code uses GPU aware communication. Otherwise, when two gpus communicate you need to copy device to host and from.host to the other device you’re interested in.
The NSYS files are here. The latest attempt is the two ranks which have filenames *2GPUs*UCX*nsys. Annoyingly I need to provide access via email, but I have done it for those who have responded here.
The GitHub discussion regarding this issue is here. We have determined that this is an issue which may be unique to Gadi, as the scaling is quite robust in the MIT machine.
The profiles are big! The conference internet is downloading them very slowly.
On Gadi I had some troubles getting multi GPU profiles without setting export TMPDIR=$MYSCRATCH/tmp where $MYSCRATCH is just the path to your scratch directory. I see that you have n_gpus nsys-rep files. This is cool but a better way to get more information is if you do:
nsys profile –stats=true mpirun -np 2 ./a.out
This will lead to a profile that has both GPUs on the same timeline, which is way more useful to diagnose this problem. If you try this on Gadi you will probably run into the issue where the profiler will complain and crash. That’s when you set the TMPDIR variable to help mitigate this. If this does not work…then I can raise a question inside the MPI.
I could try to reproduce it too if you can provide me with some configurations and a sample run so that I can have a go at it.