Wednesday, November 21, 2012

Using Thrust on CARMA


Thurst is an excellent library for CUDA development.
Unfortunately, Thrust is not present in the CARMA Toolkit but it is easy to install.

On the x86 development system, we are going to pull down the latest source from Thrust using git.
If git is not installed, we can easily add to the system with:

  sudo apt-get install git

and then clone the git repository

  git clone https://github.com/thrust/thrust.git


We are now ready to cross-compile. Remember that Thrust is a template library, everything is build from include files.
Using our standard Makefile, we just need to add the directory in which the Thrust include files are ( in this case /home/ubuntu/thrust). 
We also want to restrict the code generation to arch sm_21 ( the CARMA kit has a Q1000m GPU with 2.1 compute capabilities) to reduce the compilation time.
We are going to use one of the examples shipping with Thrust, monte_carlo.cu

############################
#  Makefile for cross-compile #
############################
all : monte_carlo

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32
THRUST_LOC=/home/ubuntu/thrust

monte_carlo : monte_carlo.cu
        $(NVCC)  -O3 -arch sm_21 -o monte_carlo -I$(THRUST_LOC) monte_carlo.cu

clean:
        rm monte_carlo

Once we generate the executable, we can copy it on the CARMA 

  scp monte_carlo ubuntu@carma:~

and execute it. We will see the number pi printed with 2 digits ( 3.14).
If you want to see more digits, you can change the source code and set the precision to 6 instead of the original 2

  std::cout << std::setprecision(6);


Monday, October 29, 2012

Setting up a CARMA kit

I just received a brand new CARMA kit and I am going to post all the steps I did to get a working set-up.

Let's start with the x86 development system. I am using a virtual machine on my Mac as my development system.

I started by installing a fresh Ubuntu 11.04 distro and then proceed to :
  • Update the packages: 
    • sudo apt-get update
  • Install the basic developer tools: 
    • sudo apt-get install build-essential
  • Install the 32bit development libraries ( CARMA is 32bit ):
    • sudo apt-get install ia32-libs
  • Install the ARM cross compilers: 
    • sudo apt-get install gcc-4.5-arm-linux-gnueabi g++-4.5-arm-linux-gnueabi
  • Install Fortran for both x86 and ARM (real developers use Fortran....):
    • sudo apt-get install gfortran-4.5-*
  • Install the CUDA Toolkit (available from http://www.seco.com/carmakit under the downloads tab): 
    • sudo sh cuda-linux-ARMv7-rel-4.2.10-13489154.run
  • Edit .bashrc to add nvcc to the path. With your favorite editor add a line at the end of the file:
    • export PATH=/usr/local/cuda/bin:$PATH
  • Source the .bashrc to refresh the path ( it will be automatically executed the next time you login or open a terminal):
    • . .bashrc
We can check that nvcc is now in our path, invoking the compiler with the -V flag to check the version


max@ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Tue_Jul_17_14:48:12_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

We are now ready to compile our first CUDA code, a comparison between multiplications on CPU and GPU.


#include "stdio.h"

__global__ void kernel(int i, float *d_n)
{
*d_n *= 1.02f;
}

void main(){
 float n = 1.0f, *d_n;
 float n_ref = 1.0f;
 int i;
 cudaMalloc((void **)&d_n, sizeof(float));
 for(i = 1; i <= 10; i++)
 {
  cudaMemcpy(d_n, &n, sizeof(float), cudaMemcpyHostToDevice);
  kernel <<< 1, 1 >>> (i, d_n);
  cudaMemcpy(&n, d_n, sizeof(float), cudaMemcpyDeviceToHost);
  printf("%d\t\t%42.41f\t%42.41f\n", i, n,n_ref*=1.02f);
 }
}


We are going to use a Makefile similar to the one posted in the previous blog.


max@ubuntu:~$ cat Makefile 
############################
#  Makefile for cross-compile #
############################
all : gpu_test

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

gpu_test : gpu_test.cu
$(NVCC)  -o gpu_test gpu_test.cu 

clean:
rm gpu_test



When we type make, we should see a similar output


max@ubuntu:~$ make
/usr/local/cuda/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32  -o gpu_test gpu_test.cu 
/usr/lib/gcc/arm-linux-gnueabi/4.5.2/../../../../arm-linux-gnueabi/bin/ld: warning: libc.so, needed by /usr/arm-linux-gnueabi/lib//libgcc_s.so.1, not found (try using -rpath or -rpath-link)



Don't worry about the warning. This is caused by a bogus DT_NEEDED entry in the shared libgcc file /usr/arm-linux-gnueabi/lib/libgcc_s.so.1. "readelf -a" shows:
 0x00000001 (NEEDED) Shared library: [libc.so]
Before we could use the machine for any real CUDA development, there is an extra step that we will need to perform.  The CUDA Toolkit is missing the libcuda.so ( it usually comes with the driver on the x86 platform, don't ask me why it was not included in the ARM toolkit), we will not be able to link any CUDA code before we bring this library to the x86. We will do this step once we have the CARMA up and running.


Unpack the CARMA, plugin keyboard and mouse, plus the HDMI cable in the middle connector.
Plug in the power and ethernet cable and you are ready to go.
The first boot may be slow, the system is building the NVIDIA driver. It is a blind boot, there is no console output until the GUI comes up, so you need to have a little bit of patience.

Once the CARMA system boots, it will auto-login and start a terminal. It should also pick up an IP address ( use ifconfig to find out the IP). The default username/password is ubuntu/ubuntu.

We are ready  to check if our cross-compilation worked. 
From inside the virtual machine, we will copy the file gpu_test to the CARMA ( ipconfig is reporting 
172.16.174.185 ):

   scp gpu_test ubuntu@172.16.174.185 :~

Either from the CARMA terminal or from a remote shell, we can run gpu_test and check that the CPU and GPU results are the same.

ubuntu@tegra-ubuntu:~$ ./gpu_test 
1 1.01999998092651367187500000000000000000000 1.01999998092651367187500000000000000000000
2 1.04039990901947021484375000000000000000000 1.04039990901947021484375000000000000000000
3 1.06120789051055908203125000000000000000000 1.06120789051055908203125000000000000000000
4 1.08243203163146972656250000000000000000000 1.08243203163146972656250000000000000000000
5 1.10408067703247070312500000000000000000000 1.10408067703247070312500000000000000000000
6 1.12616229057312011718750000000000000000000 1.12616229057312011718750000000000000000000
7 1.14868545532226562500000000000000000000000 1.14868545532226562500000000000000000000000
8 1.17165911197662353515625000000000000000000 1.17165911197662353515625000000000000000000
9 1.19509232044219970703125000000000000000000 1.19509232044219970703125000000000000000000
10 1.21899414062500000000000000000000000000000 1.21899414062500000000000000000000000000000

The CARMA filesystem is quite bare, let's add few useful packages:
  • Install Fortran:
    • sudo apt-get install gfortran
We need to install OpenMPI from source, the default packages don't seem to work.
The latest source (1.6.2) has support for ARM, the installation is very simple but it will take a while.

Get the latest stable version 
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.2.tar.gz

unpack it ( tar xvfz openmpi-1.6.2.tar.gz) and change the directory ( cd openmpi-1.6.2  )

We are now ready to build and install
./configure
sudo make -j 4 install

Add /usr/local/bin to your PATH and /usr/local/lib to your LD_LIBRARY_PATH





Sunday, September 30, 2012

Compiling for CARMA

In few days, CARMA will be finally available to the general public. If you are not familiar with the CARMA project, it is the first ARM platform supporting CUDA.
It has a Tegra 3 with 4 cores and 2 GB of memory, ethernet, USB ports and a Quadro 1000M GPU (GF108 with 2 GB of memory, 96 CUDA cores, compute capability 2.1).
It has full OpenGL and CUDA support, but at the moment, no CUDA compiler.

The developer needs to cross-compile from a Linux x86 machine. This blog shows how easy it is to cross-compile once we follow some simple instructions. I strongly suggest that you start with an Ubuntu machine, the cross-compiler are easily available under this platform.

The first thing to do, it is to install the cross-compilers:

sudo apt-get install g++-arm-linux-gnueabi gcc-arm-linux-gnueabi

At this point, we will have the cross-compilers installed under  /usr/bin/arm-linux-gnueabi-gcc and  /usr/bin/arm-linux-gnueabi-g++.

The second step is to install the CUDA Toolkit for ARM on the x86. If you choose the default location,
the installer will create a directory /usr/local/cuda.

If you need to use other libraries for ARM, you will also need to copy the libraries and corresponding header files from CARMA to the x86 machine.  You can place them under /usr/local/arm_lib and /usr/local/arm_include or you can just put them under /usr/local/cuda/lib and /usr/local/cuda/include (my preference will be for the first option to not pollute the CUDA installation).

We are now ready to compile our code, taking care of using the cross compiler and the special nvcc in the CARMA toolkit.  The following makefile will show how to compile a simple c++ code that calls a CUBLAS function and a simple CUDA code.


############################
#  Makefile for cross-compile #
############################
all : dgemm_cublas simple_cuda

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32


# For a standard c++ code, we use CC and the CUDA ARM libraries
dgemm_cublas : gemm_test.cpp
$(CC)   gemm_test.cpp -I$(CUDA_HOME)/include -o dgemm_cublas -L/$(CUDA_HOME)/lib -lcudart -lcublas

# For a standard CUDA code, we just invoke nvcc
simple_cuda: file.cu
$(NVCC) -o simple_cuda file.cu

clean :
rm -f *.o dgemm_cublas simple_cuda


Once we generate the executable, since they are for ARM, we will not be able to execute them until we move them on CARMA.




Saturday, November 12, 2011

MPI communications from GPU memory

There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.

Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:
  • Transfer data from GPU memory to host memory on system A
  • Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv
  • Transfer data from host memory to GPU memory on system B
could just issue the MPI_Send/Recv with the buffers located on GPU memory.
A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).

In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.

After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.

We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).


FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure
--prefix=/share/apps/mvapich2-gpu
--enable-cuda
--with-cuda-include=/usr/local/cuda/include
--with-cuda-libpath=/usr/local/cuda/lib64

The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.

We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.

program mpi_test_gpu
use mpi
integer, allocatable:: a(:)
integer, device,allocatable:: a_d(:),b_d(:)
integer:: N, ierr, rank, num_procs, status(MPI_Status_size)

call MPI_Init (ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

N=4
allocate (a(N),a_d(N),b_d(N))
a_d=(rank+1)*10
b_d=(rank-1)*100

a=-999
if ( rank == 0) then
call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)
else
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
end if

if (rank == 1) a=b_d

print *,"Rank=",rank,"A=",a

deallocate (a,a_d,b_d)

call MPI_Finalize ( ierr )
end program mpi_test_gpu

If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:

mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf

We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add
export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).
We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.

mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu
Rank= 0 A= -999 -999 -999 -999
Rank= 1 A= 10 10 10 10

As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.
MVAPICH also allows to send data from GPU to host memory and vice versa.
For example we could replace the lines:

! Receive data to GPU array b_d from processor 0
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
...
! Copy GPU array b_d to CPU array a
if (rank == 1) a=b_d

directly with
! Receive data to CPU array a from processor 0
call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

Tuesday, August 16, 2011

CUDA, MPI and Infiniband

There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).

Let's start with a simple program using 2 MPI processes that:

  • allocate data on the CPU and GPU
  • initialize the data on the CPU
  • copy the data on the GPU
  • transfer the host data from one process to the other

The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.


#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include <mpi.h>

#define NREPEAT 10
#define NBYTES 10.e6

int main (int argc, char *argv[])
{
int rank, size, n, len, numbytes;
void *a_h, *a_d;
struct timeval time[2];
double bandwidth;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;

MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);

MPI_Get_processor_name(name, &len);
printf("Process %d is on %s\n", rank, name);

printf("Using regular memory \n");
a_h = malloc(NBYTES);

cudaMalloc( (void **) &a_d, NBYTES);

/* Test host -> device bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);

/* Test MPI send/recv bandwidth. */
MPI_Barrier(MPI_COMM_WORLD);

gettimeofday(&time[0], NULL);
for (n=0; n<NREPEAT; n )
{
if (rank == 0)
MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);
else
MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
}
gettimeofday(&time[1], NULL);

bandwidth = time[1].tv_sec - time[0].tv_sec;
bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;

if (rank == 0)
printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);

cudaFree(a_d);
free(a_h);

MPI_Finalize();
return 0;
}


Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:

mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:

#mpirun -np 2 -host c0-0,c0-1 mpi_malloc

Process 0 is on compute-0-0.local
Using regular memory
Process 1 is on compute-0-1.local
Using regular memory
Host->device bandwidth for process 0: 4699.248120 MB/sec
Host->device bandwidth for process 1: 4323.950361 MB/sec
MPI send/recv bandwidth: 2467.369044 MB/sec

Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.


#mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5927.330923 MB/sec
Host->device bandwidth for process 1: 5909.117769 MB/sec

If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.




0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
3175 btl_openib_component.c: No such file or directory.
in btl_openib_component.c
(gdb) where
#0 0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
#1 0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207
#2 0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,
sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125
#3 0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72
#4 0x0000000000404d1d in main () at ./mpi_pinned.c:69


Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:

  1. Disable RDMA in MPI
  2. Make the Infiniband driver and CUDA runtime compatible
The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled

mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned

Process 1 is on compute-0-1.local
Using pinned memory
Process 0 is on compute-0-0.local
Using pinned memory
Host->device bandwidth for process 0: 5907.023451 MB/sec
Host->device bandwidth for process 1: 5877.858109 MB/sec
MPI send/recv bandwidth: 2713.041591 MB/sec


Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility.
With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.


mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 0 is on compute-0-0.local
Using pinned memory
Process 1 is on compute-0-1.local
Using pinned memory
Host->device bandwidth for process 0: 5904.930617 MB/sec
Host->device bandwidth for process 1: 5901.445854 MB/sec
MPI send/recv bandwidth: 3150.300854 MB/sec

This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!

Thursday, June 2, 2011

Calling Thrust from CUDA Fortran

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, that act like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h >

#include <thrust/device_vector.h>

#include <thrust/sort.h>


int main(void) {

// define a vector of 16M int on the host

thrust::host_vector h_vec(1 << 24);


// generate 16M random numbers on the host

thrust::generate(h_vec.begin(), h_vec.end(), rand);


// transfer data to the device

thrust::device_vector d_vec = h_vec;


// sort data on the device

thrust::sort(d_vec.begin(), d_vec.end());


// transfer data back to host

thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

return 0;


An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector

thrust::device_vector d_vec(4);

// obtain raw pointer to device vector’s memory

int * ptr = thrust::raw_pointer_cast(&d_vec[0]);


The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu
// nvcc -c -arch sm_13 csort.cu

#include <thrust/device_vector.h>

#include <thrust/device_vector.h>

#include <thrust/sort.h>


extern "C" {

//Sort for integer arrays

void sort_int_wrapper( int *data, int N)

{

// Wrap raw pointer with a device_ptr

thrust::device_ptr <int> dev_ptr(data);

// Use device_ptr in Thrust sort algorithm

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for float arrays

void sort_float_wrapper( float *data, int N)

{

thrust::device_ptr <float> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


//Sort for double arrays

void sort_double_wrapper( double *data, int N)

{

thrust::device_ptr <double> dev_ptr(data);

thrust::sort(dev_ptr, dev_ptr+N);

}


}


We can compile the code using
nvcc -c -arch sm_13 csort.cu
This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.
We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort
subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")
use iso_c_binding
integer(c_int),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")
use iso_c_binding
real(c_float),device:: input(*)
integer(c_int),value:: N
end subroutine

subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")
use iso_c_binding
real(c_double),device:: input(*)
integer(c_int),value:: N
end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort
use thrust
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: N=10
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do
cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData
end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:

$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o
thrust_module.cuf:
sample_sort.cuf:

$ ./a.out
Before sorting 4.1630346E-02 0.9124327 0.7832350 0.6540373
100.0000 0.3956419 0.2664442 0.1372465
8.0488138E-03 0.8788511

After sorting 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442
0.3956419 0.6540373 0.7832350 0.8788511
0.9124327 100.0000

The code is very simple:
  • declare two arrays, cpuData and gpuData.
  • allocate them using the standard allocate
  • copy cpuData from the host to gpuData on the GPU with a simple assignment
  • call the Thrust sort routine
  • copy sorted array back to the host
  • print the sorted array
Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.

program timesort
use cudafor
use thrust
implicit none
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: i,N=100000000

! cuda events for elapsing
type ( cudaEvent ) :: startEvent , stopEvent
real :: time, random
integer :: istat

! Create events
istat = cudaEventCreate ( startEvent )
istat = cudaEventCreate ( stopEvent )

! Allocate arrays
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
cpuData(i)=random(i)
end do

print *,"Sorting array of ",N, " single precision"

gpuData=cpuData

istat = cudaEventRecord ( startEvent , 0)
call thrustsort(gpuData,size(gpuData))

istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )

cpuData=gpuData

print *," Sorted array in:",time," (ms)"

!Print the first five elements and the last five.
print *,"After sorting", cpuData(1:5),cpuData(N-4:N)
end program

We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.

pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort
thrust_module.cuf:
time_sort.cuf:

$ ./time_sort
Sorting array of 100000000 single precision
Sorted array in: 222.1711 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 1.000000
1.000000 1.000000
./a.out
Sorting array of 100000000 single precision
Sorted array in: 225.0452 (ms)
After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08
4.4078664E-08 0.9999999 0.9999999 0.9999999
1.000000 1.000000

Tuesday, May 10, 2011

Using CUDA 4.0 from CUDA Fortran

It is possible to use CUDA 4.0 RC2 with CUDA Fortran.

Assuming that the CUDA 4.0 toolkit is installed in the location /usr/local/cuda, you will need to create a file rc4.0 containing the following lines:

set CUDAROOT=/usr/local/cuda;
set CUDAVERSION=4.0;


When you compile your .cuf files, you will need to pass this rc file with the -rc flag and add the -L flag if you are using libraries from the 4.0 toolking

pgf90 -rc=rc4.0 -Mcuda=cc20,nofma myfile.cuf -L/usr/local/cuda/lib64 -lcufft -lcurand


You can check if the compiler is picking up the new toolkit running ldd on the executable.