There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.
Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:
- Transfer data from GPU memory to host memory on system A
- Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv
- Transfer data from host memory to GPU memory on system B
A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).
In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.
After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.
We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).
FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure
--prefix=/share/apps/mvapich2-gpu
--enable-cuda
--with-cuda-include=/usr/local/cuda/include
--with-cuda-libpath=/usr/local/cuda/lib64
The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.
We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.
program mpi_test_gpu
use mpi
integer, allocatable:: a(:)
integer, device,allocatable:: a_d(:),b_d(:)
integer:: N, ierr, rank, num_procs, status(MPI_Status_size)
call MPI_Init (ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)
N=4
allocate (a(N),a_d(N),b_d(N))
a_d=(rank+1)*10
b_d=(rank-1)*100
a=-999
if ( rank == 0) then
call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)
else
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
end if
if (rank == 1) a=b_d
print *,"Rank=",rank,"A=",a
deallocate (a,a_d,b_d)
call MPI_Finalize ( ierr )
end program mpi_test_gpu
If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:
mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf
We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add
export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).
We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.
mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu
Rank= 0 A= -999 -999 -999 -999
Rank= 1 A= 10 10 10 10
As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.
MVAPICH also allows to send data from GPU to host memory and vice versa.
For example we could replace the lines:
! Receive data to GPU array b_d from processor 0
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
...
! Copy GPU array b_d to CPU array a
if (rank == 1) a=b_d
directly with
! Receive data to CPU array a from processor 0
call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)