There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).
Let's start with a simple program using 2 MPI processes that:
- allocate data on the CPU and GPU
- initialize the data on the CPU
- copy the data on the GPU
- transfer the host data from one process to the other
The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include <mpi.h>
#define NREPEAT 10
#define NBYTES 10.e6
int main (int argc, char *argv[])
{
int rank, size, n, len, numbytes;
void *a_h, *a_d;
struct timeval time[2];
double bandwidth;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &len);
printf("Process %d is on %s\n", rank, name);
printf("Using regular memory \n");
a_h = malloc(NBYTES);
cudaMalloc( (void **) &a_d, NBYTES);
/* Test host -> device bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);
gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);
}
gettimeofday(&time[1], NULL);
bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);
/* Test MPI send/recv bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);
gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
if (rank == 0)
MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);
else
MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
gettimeofday(&time[1], NULL);
bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
if (rank == 0)
printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);
cudaFree(a_d);
free(a_h);
MPI_Finalize();
return 0;
}
mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart
Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:
#mpirun -np 2 -host c0-0,c0-1 mpi_malloc
Process 0 is on compute-0-0.local
Using regular memory
Process 1 is on compute-0-1.local
Using regular memory
Host->device bandwidth for process 0: 4699.248120 MB/sec
Host->device bandwidth for process 1: 4323.950361 MB/sec
MPI send/recv bandwidth: 2467.369044 MB/sec
Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.
#mpirun -np 2 -host c0-0,c0-1 mpi_pinned
Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5927.330923 MB/sec
Host->device bandwidth for process 1: 5909.117769 MB/sec
If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.
0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
3175 btl_openib_component.c: No such file or directory.
in btl_openib_component.c
(gdb) where
#0 0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
#1 0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207
#2 0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,
sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125
#3 0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72
#4 0x0000000000404d1d in main () at ./mpi_pinned.c:69
- Disable RDMA in MPI
- Make the Infiniband driver and CUDA runtime compatible
mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned
Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5907.023451 MB/sec
Host->device bandwidth for process 1: 5877.858109 MB/sec
MPI send/recv bandwidth: 2713.041591 MB/sec
mpirun -np 2 -host c0-0,c0-1 mpi_pinned
Process 0 is on compute-0-0.local
Using pinned memory
Process 1 is on compute-0-1.local
Using pinned memory
Host->device bandwidth for process 0: 5904.930617 MB/sec
Host->device bandwidth for process 1: 5901.445854 MB/sec
MPI send/recv bandwidth: 3150.300854 MB/sec
This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!