Thursday, June 2, 2011

Calling Thrust from CUDA Fortran

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, that act like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h >

#include <thrust/device_vector.h>

#include <thrust/sort.h>


int main(void) {

// define a vector of 16M int on the host

thrust::host_vector h_vec(1 << 24);


// generate 16M random numbers on the host

thrust::generate(h_vec.begin(), h_vec.end(), rand);


// transfer data to the device

thrust::device_vector d_vec = h_vec;


// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());


// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

return 0;


An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector

thrust::device_vector d_vec(4);

// obtain raw pointer to device vector’s memory

int * ptr = thrust::raw_pointer_cast(&d_vec[0]);


The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu
// nvcc -c -arch sm_13 csort.cu

#include <thrust/device_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>


extern "C" {

//Sort for integer arrays

void sort_int_wrapper( int *data, int N)

{

// Wrap raw pointer with a device_ptr

thrust::device_ptr <int> dev_ptr(data);

// Use device_ptr in Thrust sort algorithm

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for float arrays

void sort_float_wrapper( float *data, int N)

{

thrust::device_ptr <float> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for double arrays

void sort_double_wrapper( double *data, int N)

{

thrust::device_ptr <double> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


}


We can compile the code using
nvcc -c -arch sm_13 csort.cu
This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.
We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort
subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")
use iso_c_binding
integer(c_int),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")
use iso_c_binding
real(c_float),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")
use iso_c_binding
real(c_double),device:: input(*)
integer(c_int),value:: N
end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort
use thrust
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: N=10
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do
cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData
end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:

$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o
thrust_module.cuf:
sample_sort.cuf:

$ ./a.out
Before sorting 4.1630346E-02 0.9124327 0.7832350 0.6540373
100.0000 0.3956419 0.2664442 0.1372465
8.0488138E-03 0.8788511

After sorting 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442
0.3956419 0.6540373 0.7832350 0.8788511
0.9124327 100.0000

The code is very simple:
  • declare two arrays, cpuData and gpuData.
  • allocate them using the standard allocate
  • copy cpuData from the host to gpuData on the GPU with a simple assignment
  • call the Thrust sort routine
  • copy sorted array back to the host
  • print the sorted array
Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.

program timesort
use cudafor
use thrust
implicit none
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: i,N=100000000

! cuda events for elapsing
type ( cudaEvent ) :: startEvent , stopEvent
real :: time, random
integer :: istat

! Create events
istat = cudaEventCreate ( startEvent )
istat = cudaEventCreate ( stopEvent )

! Allocate arrays
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do

print *,"Sorting array of ",N, " single precision"

gpuData=cpuData

istat = cudaEventRecord ( startEvent , 0)
call thrustsort(gpuData,size(gpuData))

istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )

cpuData=gpuData

print *," Sorted array in:",time," (ms)"

!Print the first five elements and the last five.
print *,"After sorting", cpuData(1:5),cpuData(N-4:N)
end program

We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.

pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort
thrust_module.cuf:
time_sort.cuf:

$ ./time_sort
Sorting array of 100000000 single precision
Sorted array in: 222.1711 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 1.000000
1.000000 1.000000
./a.out
Sorting array of 100000000 single precision
Sorted array in: 225.0452 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 0.9999999
1.000000 1.000000