Saturday, November 12, 2011

MPI communications from GPU memory

There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.

Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:
  • Transfer data from GPU memory to host memory on system A
  • Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv
  • Transfer data from host memory to GPU memory on system B
could just issue the MPI_Send/Recv with the buffers located on GPU memory.
A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).

In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.

After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.

We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).


FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure
--prefix=/share/apps/mvapich2-gpu
--enable-cuda
--with-cuda-include=/usr/local/cuda/include
--with-cuda-libpath=/usr/local/cuda/lib64

The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.

We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.

program mpi_test_gpu
use mpi
integer, allocatable:: a(:)
integer, device,allocatable:: a_d(:),b_d(:)
integer:: N, ierr, rank, num_procs, status(MPI_Status_size)

call MPI_Init (ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

N=4
allocate (a(N),a_d(N),b_d(N))
a_d=(rank+1)*10
b_d=(rank-1)*100

a=-999
if ( rank == 0) then
call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)
else
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
end if

if (rank == 1) a=b_d

print *,"Rank=",rank,"A=",a

deallocate (a,a_d,b_d)

call MPI_Finalize ( ierr )
end program mpi_test_gpu

If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:

mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf

We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add
export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).
We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.

mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu
Rank= 0 A= -999 -999 -999 -999
Rank= 1 A= 10 10 10 10

As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.
MVAPICH also allows to send data from GPU to host memory and vice versa.
For example we could replace the lines:

! Receive data to GPU array b_d from processor 0
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
...
! Copy GPU array b_d to CPU array a
if (rank == 1) a=b_d

directly with
! Receive data to CPU array a from processor 0
call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

Tuesday, August 16, 2011

CUDA, MPI and Infiniband

There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).

Let's start with a simple program using 2 MPI processes that:

  • allocate data on the CPU and GPU
  • initialize the data on the CPU
  • copy the data on the GPU
  • transfer the host data from one process to the other

The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.


#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include <mpi.h>

#define NREPEAT 10
#define NBYTES 10.e6

int main (int argc, char *argv[])
{
int rank, size, n, len, numbytes;
void *a_h, *a_d;
struct timeval time[2];
double bandwidth;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;

MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);

MPI_Get_processor_name(name, &len);
printf("Process %d is on %s\n", rank, name);

printf("Using regular memory \n");
a_h = malloc(NBYTES);

cudaMalloc( (void **) &a_d, NBYTES);

/* Test host -> device bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);

/* Test MPI send/recv bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
if (rank == 0)
MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);
else
MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

if (rank == 0)
printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);

cudaFree(a_d);
free(a_h);

MPI_Finalize();
return 0;
}


Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:

mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:

#mpirun -np 2 -host c0-0,c0-1 mpi_malloc

Process 0 is on compute-0-0.local
Using regular memory
Process 1 is on compute-0-1.local
Using regular memory
Host->device bandwidth for process 0: 4699.248120 MB/sec
Host->device bandwidth for process 1: 4323.950361 MB/sec
MPI send/recv bandwidth: 2467.369044 MB/sec

Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.


#mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5927.330923 MB/sec
Host->device bandwidth for process 1: 5909.117769 MB/sec

If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.




0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
3175 btl_openib_component.c: No such file or directory.
in btl_openib_component.c
(gdb) where
#0 0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
#1 0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207
#2 0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,
sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125
#3 0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72
#4 0x0000000000404d1d in main () at ./mpi_pinned.c:69


Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:

  1. Disable RDMA in MPI
  2. Make the Infiniband driver and CUDA runtime compatible
The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled

mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5907.023451 MB/sec
Host->device bandwidth for process 1: 5877.858109 MB/sec
MPI send/recv bandwidth: 2713.041591 MB/sec


Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility.
With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.


mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 0 is on compute-0-0.local
Using pinned memory
Process 1 is on compute-0-1.local
Using pinned memory
Host->device bandwidth for process 0: 5904.930617 MB/sec
Host->device bandwidth for process 1: 5901.445854 MB/sec
MPI send/recv bandwidth: 3150.300854 MB/sec

This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!

Thursday, June 2, 2011

Calling Thrust from CUDA Fortran

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, that act like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h >

#include <thrust/device_vector.h>

#include <thrust/sort.h>


int main(void) {

// define a vector of 16M int on the host

thrust::host_vector h_vec(1 << 24);


// generate 16M random numbers on the host

thrust::generate(h_vec.begin(), h_vec.end(), rand);


// transfer data to the device

thrust::device_vector d_vec = h_vec;


// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());


// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

return 0;


An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector

thrust::device_vector d_vec(4);

// obtain raw pointer to device vector’s memory

int * ptr = thrust::raw_pointer_cast(&d_vec[0]);


The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu
// nvcc -c -arch sm_13 csort.cu

#include <thrust/device_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>


extern "C" {

//Sort for integer arrays

void sort_int_wrapper( int *data, int N)

{

// Wrap raw pointer with a device_ptr

thrust::device_ptr <int> dev_ptr(data);

// Use device_ptr in Thrust sort algorithm

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for float arrays

void sort_float_wrapper( float *data, int N)

{

thrust::device_ptr <float> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for double arrays

void sort_double_wrapper( double *data, int N)

{

thrust::device_ptr <double> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


}


We can compile the code using
nvcc -c -arch sm_13 csort.cu
This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.
We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort
subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")
use iso_c_binding
integer(c_int),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")
use iso_c_binding
real(c_float),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")
use iso_c_binding
real(c_double),device:: input(*)
integer(c_int),value:: N
end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort
use thrust
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: N=10
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do
cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData
end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:

$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o
thrust_module.cuf:
sample_sort.cuf:

$ ./a.out
Before sorting 4.1630346E-02 0.9124327 0.7832350 0.6540373
100.0000 0.3956419 0.2664442 0.1372465
8.0488138E-03 0.8788511

After sorting 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442
0.3956419 0.6540373 0.7832350 0.8788511
0.9124327 100.0000

The code is very simple:
  • declare two arrays, cpuData and gpuData.
  • allocate them using the standard allocate
  • copy cpuData from the host to gpuData on the GPU with a simple assignment
  • call the Thrust sort routine
  • copy sorted array back to the host
  • print the sorted array
Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.

program timesort
use cudafor
use thrust
implicit none
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: i,N=100000000

! cuda events for elapsing
type ( cudaEvent ) :: startEvent , stopEvent
real :: time, random
integer :: istat

! Create events
istat = cudaEventCreate ( startEvent )
istat = cudaEventCreate ( stopEvent )

! Allocate arrays
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do

print *,"Sorting array of ",N, " single precision"

gpuData=cpuData

istat = cudaEventRecord ( startEvent , 0)
call thrustsort(gpuData,size(gpuData))

istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )

cpuData=gpuData

print *," Sorted array in:",time," (ms)"

!Print the first five elements and the last five.
print *,"After sorting", cpuData(1:5),cpuData(N-4:N)
end program

We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.

pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort
thrust_module.cuf:
time_sort.cuf:

$ ./time_sort
Sorting array of 100000000 single precision
Sorted array in: 222.1711 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 1.000000
1.000000 1.000000
./a.out
Sorting array of 100000000 single precision
Sorted array in: 225.0452 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 0.9999999
1.000000 1.000000

Tuesday, May 10, 2011

Using CUDA 4.0 from CUDA Fortran

It is possible to use CUDA 4.0 RC2 with CUDA Fortran.

Assuming that the CUDA 4.0 toolkit is installed in the location /usr/local/cuda, you will need to create a file rc4.0 containing the following lines:

set CUDAROOT=/usr/local/cuda;
set CUDAVERSION=4.0;


When you compile your .cuf files, you will need to pass this rc file with the -rc flag and add the -L flag if you are using libraries from the 4.0 toolking

pgf90 -rc=rc4.0 -Mcuda=cc20,nofma myfile.cuf -L/usr/local/cuda/lib64 -lcufft -lcurand


You can check if the compiler is picking up the new toolkit running ldd on the executable.