Tuesday, August 16, 2011

CUDA, MPI and Infiniband

There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).

Let's start with a simple program using 2 MPI processes that:

  • allocate data on the CPU and GPU
  • initialize the data on the CPU
  • copy the data on the GPU
  • transfer the host data from one process to the other

The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.


#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include <mpi.h>

#define NREPEAT 10
#define NBYTES 10.e6

int main (int argc, char *argv[])
{
int rank, size, n, len, numbytes;
void *a_h, *a_d;
struct timeval time[2];
double bandwidth;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;

MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);

MPI_Get_processor_name(name, &len);
printf("Process %d is on %s\n", rank, name);

printf("Using regular memory \n");
a_h = malloc(NBYTES);

cudaMalloc( (void **) &a_d, NBYTES);

/* Test host -> device bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);

/* Test MPI send/recv bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
if (rank == 0)
MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);
else
MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

if (rank == 0)
printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);

cudaFree(a_d);
free(a_h);

MPI_Finalize();
return 0;
}


Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:

mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:

#mpirun -np 2 -host c0-0,c0-1 mpi_malloc

Process 0 is on compute-0-0.local
Using regular memory
Process 1 is on compute-0-1.local
Using regular memory
Host->device bandwidth for process 0: 4699.248120 MB/sec
Host->device bandwidth for process 1: 4323.950361 MB/sec
MPI send/recv bandwidth: 2467.369044 MB/sec

Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.


#mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5927.330923 MB/sec
Host->device bandwidth for process 1: 5909.117769 MB/sec

If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.




0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
3175 btl_openib_component.c: No such file or directory.
in btl_openib_component.c
(gdb) where
#0 0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
#1 0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207
#2 0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,
sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125
#3 0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72
#4 0x0000000000404d1d in main () at ./mpi_pinned.c:69


Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:

  1. Disable RDMA in MPI
  2. Make the Infiniband driver and CUDA runtime compatible
The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled

mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5907.023451 MB/sec
Host->device bandwidth for process 1: 5877.858109 MB/sec
MPI send/recv bandwidth: 2713.041591 MB/sec


Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility.
With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.


mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 0 is on compute-0-0.local
Using pinned memory
Process 1 is on compute-0-1.local
Using pinned memory
Host->device bandwidth for process 0: 5904.930617 MB/sec
Host->device bandwidth for process 1: 5901.445854 MB/sec
MPI send/recv bandwidth: 3150.300854 MB/sec

This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!