CUDA Musing: June 2011

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, that act like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h > 
#include <thrust/device_vector.h> 
#include <thrust/sort.h>

 int main(void) {
 // define a vector of 16M int on the host
 thrust::host_vector h_vec(1 << 24); 

// generate 16M random numbers on the host
thrust::generate(h_vec.begin(), h_vec.end(), rand);

 // transfer data to the device
 thrust::device_vector d_vec = h_vec; 

// sort data on the device
 thrust::sort(d_vec.begin(), d_vec.end()); 

// transfer data back to host
 thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); 
return 0;

An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector
 thrust::device_vector d_vec(4); 
// obtain raw pointer to device vector’s memory
 int * ptr = thrust::raw_pointer_cast(&d_vec[0]);

The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu
// nvcc  -c -arch sm_13 csort.cu 
#include <thrust/device_vector.h> 
#include <thrust/device_vector.h>
#include <thrust/sort.h>

extern "C" {
//Sort for  integer arrays
void sort_int_wrapper( int *data, int N)
 {
  // Wrap raw pointer with a device_ptr
    thrust::device_ptr <int>  dev_ptr(data);
  // Use device_ptr in Thrust sort algorithm
    thrust::sort(dev_ptr, dev_ptr+N);
}

//Sort for  float arrays
 void sort_float_wrapper( float *data, int N)
 {
   thrust::device_ptr <float>  dev_ptr(data);
   thrust::sort(dev_ptr, dev_ptr+N);
 }

//Sort for  double arrays
 void sort_double_wrapper( double *data, int N)
 {
  thrust::device_ptr <double> dev_ptr(data);
  thrust::sort(dev_ptr, dev_ptr+N);
 }

}

We can compile the code using

nvcc -c -arch sm_13 csort.cu

This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.

We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort
 subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")
  use iso_c_binding
  integer(c_int),device:: input(*)
  integer(c_int),value:: N
 end subroutine

 subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")
  use iso_c_binding
  real(c_float),device:: input(*)
  integer(c_int),value:: N
 end subroutine

 subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")
  use iso_c_binding
  real(c_double),device:: input(*)
  integer(c_int),value:: N
 end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort
use thrust
real, allocatable              :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: N=10
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
 cpuData(i)=random(i)
end do
cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData
end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:

$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o
thrust_module.cuf:
sample_sort.cuf:

$ ./a.out 
 Before sorting   4.1630346E-02   0.9124327       0.7832350       0.6540373    
     100.0000       0.3956419       0.2664442       0.1372465     
   8.0488138E-03   0.8788511    

 After sorting   8.0488138E-03   4.1630346E-02   0.1372465       0.2664442     
   0.3956419       0.6540373       0.7832350       0.8788511     
   0.9124327        100.0000

The code is very simple:

declare two arrays, cpuData and gpuData.
allocate them using the standard allocate
copy cpuData from the host to gpuData on the GPU with a simple assignment
call the Thrust sort routine
copy sorted array back to the host
print the sorted array

Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.

program timesort
use cudafor
use thrust
implicit none
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: i,N=100000000

! cuda events for elapsing
type ( cudaEvent ) :: startEvent , stopEvent
real :: time, random
integer :: istat

! Create events
istat = cudaEventCreate ( startEvent )
istat = cudaEventCreate ( stopEvent )

! Allocate arrays
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
 cpuData(i)=random(i)
end do

print *,"Sorting array of ",N, " single precision"

gpuData=cpuData

istat = cudaEventRecord ( startEvent , 0)
call thrustsort(gpuData,size(gpuData))

istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )

cpuData=gpuData

print *," Sorted array in:",time," (ms)"

!Print the first five elements and the last five.
print *,"After sorting", cpuData(1:5),cpuData(N-4:N)
end program

We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.

pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort
thrust_module.cuf:
time_sort.cuf:

$ ./time_sort
 Sorting array of     100000000  single precision
  Sorted array in:    222.1711      (ms)
 After sorting   7.0585919E-09   1.0318221E-08   1.9398616E-08   3.1738640E-08 
   4.4078664E-08   0.9999999       0.9999999        1.000000     
    1.000000        1.000000   
./a.out 
 Sorting array of     100000000  single precision
  Sorted array in:    225.0452      (ms)
 After sorting   7.0585919E-09   1.0318221E-08   1.9398616E-08   3.1738640E-08 
   4.4078664E-08   0.9999999       0.9999999       0.9999999     
    1.000000        1.000000

CUDA Musing

Thursday, June 2, 2011

Calling Thrust from CUDA Fortran

Search This Blog

Followers

Blog Archive

About Me