CUDA Musing: CUDA, MPI and Infiniband

There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).

Let's start with a simple program using 2 MPI processes that:

allocate data on the CPU and GPU
initialize the data on the CPU
copy the data on the GPU
transfer the host data from one process to the other

The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.



#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <sys/time.h>

#include <mpi.h>



#define NREPEAT 10

#define NBYTES  10.e6



int main (int argc, char *argv[])

{

int rank, size, n, len, numbytes;

void *a_h, *a_d;

struct timeval time[2];

double bandwidth;

char name[MPI_MAX_PROCESSOR_NAME];

MPI_Status status;



MPI_Init (&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Comm_size (MPI_COMM_WORLD, &size);



MPI_Get_processor_name(name, &len);

printf("Process %d is on %s\n", rank, name);



printf("Using regular memory \n");

a_h = malloc(NBYTES);



cudaMalloc( (void **) &a_d, NBYTES);



/* Test host -> device bandwidth. */

MPI_Barrier(MPI_COMM_WORLD);



gettimeofday(&time[0], NULL);

for (n=0; n<NREPEAT; n  )

{

cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);

}

gettimeofday(&time[1], NULL);



bandwidth  =        time[1].tv_sec  - time[0].tv_sec;

bandwidth  = 1.e-6*(time[1].tv_usec - time[0].tv_usec);

bandwidth  = NBYTES*NREPEAT/1.e6/bandwidth;



printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);



/* Test MPI send/recv bandwidth. */

MPI_Barrier(MPI_COMM_WORLD);



gettimeofday(&time[0], NULL);

for (n=0; n<NREPEAT; n  )

{

if (rank == 0)

MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);

else

MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

}

gettimeofday(&time[1], NULL);



bandwidth  =        time[1].tv_sec  - time[0].tv_sec;

bandwidth  = 1.e-6*(time[1].tv_usec - time[0].tv_usec);

bandwidth  = NBYTES*NREPEAT/1.e6/bandwidth;



if (rank == 0)

printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);



cudaFree(a_d);

free(a_h);



MPI_Finalize();

return 0;

}

Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:

mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:

#mpirun -np 2 -host c0-0,c0-1 mpi_malloc

Process 0 is on compute-0-0.local

Using regular memory

Process 1 is on compute-0-1.local

Using regular memory

Host->device bandwidth for process 0: 4699.248120 MB/sec

Host->device bandwidth for process 1: 4323.950361 MB/sec

MPI send/recv bandwidth: 2467.369044 MB/sec

Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.

#mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 1 is on compute-0-1.local

Using pinned memory

Process 0 is on compute-0-0.local

Using pinned memory

Host->device bandwidth for process 0: 5927.330923 MB/sec

Host->device bandwidth for process 1: 5909.117769 MB/sec

If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.



0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175

3175	btl_openib_component.c: No such file or directory.

	in btl_openib_component.c

(gdb) where

#0  0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175

#1  0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207

#2  0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,

sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125

#3  0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72

#4  0x0000000000404d1d in main () at ./mpi_pinned.c:69

Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:

Disable RDMA in MPI
Make the Infiniband driver and CUDA runtime compatible

The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled

mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned

Process 1 is on compute-0-1.local

Using pinned memory

Process 0 is on compute-0-0.local

Using pinned memory

Host->device bandwidth for process 0: 5907.023451 MB/sec

Host->device bandwidth for process 1: 5877.858109 MB/sec

MPI send/recv bandwidth: 2713.041591 MB/sec

Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility.

With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.

mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 0 is on compute-0-0.local

Using pinned memory

Process 1 is on compute-0-1.local

Using pinned memory

Host->device bandwidth for process 0: 5904.930617 MB/sec

Host->device bandwidth for process 1: 5901.445854 MB/sec

MPI send/recv bandwidth: 3150.300854 MB/sec

This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!

4 comments:

ZhswhoNovember 2, 2011 at 11:52 AM
The mpi_malloc couldn't finish running, I don't know the reason.

$ mpirun -np 2 -host head,node_1 mpi_malloc
Process 0 is on head
Using regular memory
Process 1 is on node_1
Using regular memory

I tried modify the code to mpi_pinned.c
When running it, got bellow error

$ mpirun -np 2 -host head,node_1 -mca btl_openib_flags 1 mpi_pinned
Process 0 is on head
Using pinned memory
Process 1 is on node_1
Using pinned memory
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8891 on node head exited on signal 11 (Segmentation fault).

Could you advise!

zhswho@gmail.com
ZhswhoNovember 3, 2011 at 9:38 AM
Resolved with code debugging.
UnknownJune 29, 2012 at 4:11 PM
This comment has been removed by the author.
zkJuly 21, 2012 at 8:05 AM
the code has typos in the for loops and bandwidth calculation

CUDA Musing

Tuesday, August 16, 2011

CUDA, MPI and Infiniband

4 comments:

Search This Blog

Followers

Blog Archive

About Me