Wednesday, November 21, 2012

Using Thrust on CARMA


Thurst is an excellent library for CUDA development.
Unfortunately, Thrust is not present in the CARMA Toolkit but it is easy to install.

On the x86 development system, we are going to pull down the latest source from Thrust using git.
If git is not installed, we can easily add to the system with:

  sudo apt-get install git

and then clone the git repository

  git clone https://github.com/thrust/thrust.git


We are now ready to cross-compile. Remember that Thrust is a template library, everything is build from include files.
Using our standard Makefile, we just need to add the directory in which the Thrust include files are ( in this case /home/ubuntu/thrust). 
We also want to restrict the code generation to arch sm_21 ( the CARMA kit has a Q1000m GPU with 2.1 compute capabilities) to reduce the compilation time.
We are going to use one of the examples shipping with Thrust, monte_carlo.cu

############################
#  Makefile for cross-compile #
############################
all : monte_carlo

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32
THRUST_LOC=/home/ubuntu/thrust

monte_carlo : monte_carlo.cu
        $(NVCC)  -O3 -arch sm_21 -o monte_carlo -I$(THRUST_LOC) monte_carlo.cu

clean:
        rm monte_carlo

Once we generate the executable, we can copy it on the CARMA 

  scp monte_carlo ubuntu@carma:~

and execute it. We will see the number pi printed with 2 digits ( 3.14).
If you want to see more digits, you can change the source code and set the precision to 6 instead of the original 2

  std::cout << std::setprecision(6);


Monday, October 29, 2012

Setting up a CARMA kit

I just received a brand new CARMA kit and I am going to post all the steps I did to get a working set-up.

Let's start with the x86 development system. I am using a virtual machine on my Mac as my development system.

I started by installing a fresh Ubuntu 11.04 distro and then proceed to :
  • Update the packages: 
    • sudo apt-get update
  • Install the basic developer tools: 
    • sudo apt-get install build-essential
  • Install the 32bit development libraries ( CARMA is 32bit ):
    • sudo apt-get install ia32-libs
  • Install the ARM cross compilers: 
    • sudo apt-get install gcc-4.5-arm-linux-gnueabi g++-4.5-arm-linux-gnueabi
  • Install Fortran for both x86 and ARM (real developers use Fortran....):
    • sudo apt-get install gfortran-4.5-*
  • Install the CUDA Toolkit (available from http://www.seco.com/carmakit under the downloads tab): 
    • sudo sh cuda-linux-ARMv7-rel-4.2.10-13489154.run
  • Edit .bashrc to add nvcc to the path. With your favorite editor add a line at the end of the file:
    • export PATH=/usr/local/cuda/bin:$PATH
  • Source the .bashrc to refresh the path ( it will be automatically executed the next time you login or open a terminal):
    • . .bashrc
We can check that nvcc is now in our path, invoking the compiler with the -V flag to check the version


max@ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Tue_Jul_17_14:48:12_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

We are now ready to compile our first CUDA code, a comparison between multiplications on CPU and GPU.


#include "stdio.h"

__global__ void kernel(int i, float *d_n)
{
*d_n *= 1.02f;
}

void main(){
 float n = 1.0f, *d_n;
 float n_ref = 1.0f;
 int i;
 cudaMalloc((void **)&d_n, sizeof(float));
 for(i = 1; i <= 10; i++)
 {
  cudaMemcpy(d_n, &n, sizeof(float), cudaMemcpyHostToDevice);
  kernel <<< 1, 1 >>> (i, d_n);
  cudaMemcpy(&n, d_n, sizeof(float), cudaMemcpyDeviceToHost);
  printf("%d\t\t%42.41f\t%42.41f\n", i, n,n_ref*=1.02f);
 }
}


We are going to use a Makefile similar to the one posted in the previous blog.


max@ubuntu:~$ cat Makefile 
############################
#  Makefile for cross-compile #
############################
all : gpu_test

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

gpu_test : gpu_test.cu
$(NVCC)  -o gpu_test gpu_test.cu 

clean:
rm gpu_test



When we type make, we should see a similar output


max@ubuntu:~$ make
/usr/local/cuda/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32  -o gpu_test gpu_test.cu 
/usr/lib/gcc/arm-linux-gnueabi/4.5.2/../../../../arm-linux-gnueabi/bin/ld: warning: libc.so, needed by /usr/arm-linux-gnueabi/lib//libgcc_s.so.1, not found (try using -rpath or -rpath-link)



Don't worry about the warning. This is caused by a bogus DT_NEEDED entry in the shared libgcc file /usr/arm-linux-gnueabi/lib/libgcc_s.so.1. "readelf -a" shows:
 0x00000001 (NEEDED) Shared library: [libc.so]
Before we could use the machine for any real CUDA development, there is an extra step that we will need to perform.  The CUDA Toolkit is missing the libcuda.so ( it usually comes with the driver on the x86 platform, don't ask me why it was not included in the ARM toolkit), we will not be able to link any CUDA code before we bring this library to the x86. We will do this step once we have the CARMA up and running.


Unpack the CARMA, plugin keyboard and mouse, plus the HDMI cable in the middle connector.
Plug in the power and ethernet cable and you are ready to go.
The first boot may be slow, the system is building the NVIDIA driver. It is a blind boot, there is no console output until the GUI comes up, so you need to have a little bit of patience.

Once the CARMA system boots, it will auto-login and start a terminal. It should also pick up an IP address ( use ifconfig to find out the IP). The default username/password is ubuntu/ubuntu.

We are ready  to check if our cross-compilation worked. 
From inside the virtual machine, we will copy the file gpu_test to the CARMA ( ipconfig is reporting 
172.16.174.185 ):

   scp gpu_test ubuntu@172.16.174.185 :~

Either from the CARMA terminal or from a remote shell, we can run gpu_test and check that the CPU and GPU results are the same.

ubuntu@tegra-ubuntu:~$ ./gpu_test 
1 1.01999998092651367187500000000000000000000 1.01999998092651367187500000000000000000000
2 1.04039990901947021484375000000000000000000 1.04039990901947021484375000000000000000000
3 1.06120789051055908203125000000000000000000 1.06120789051055908203125000000000000000000
4 1.08243203163146972656250000000000000000000 1.08243203163146972656250000000000000000000
5 1.10408067703247070312500000000000000000000 1.10408067703247070312500000000000000000000
6 1.12616229057312011718750000000000000000000 1.12616229057312011718750000000000000000000
7 1.14868545532226562500000000000000000000000 1.14868545532226562500000000000000000000000
8 1.17165911197662353515625000000000000000000 1.17165911197662353515625000000000000000000
9 1.19509232044219970703125000000000000000000 1.19509232044219970703125000000000000000000
10 1.21899414062500000000000000000000000000000 1.21899414062500000000000000000000000000000

The CARMA filesystem is quite bare, let's add few useful packages:
  • Install Fortran:
    • sudo apt-get install gfortran
We need to install OpenMPI from source, the default packages don't seem to work.
The latest source (1.6.2) has support for ARM, the installation is very simple but it will take a while.

Get the latest stable version 
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.2.tar.gz

unpack it ( tar xvfz openmpi-1.6.2.tar.gz) and change the directory ( cd openmpi-1.6.2  )

We are now ready to build and install
./configure
sudo make -j 4 install

Add /usr/local/bin to your PATH and /usr/local/lib to your LD_LIBRARY_PATH





Sunday, September 30, 2012

Compiling for CARMA

In few days, CARMA will be finally available to the general public. If you are not familiar with the CARMA project, it is the first ARM platform supporting CUDA.
It has a Tegra 3 with 4 cores and 2 GB of memory, ethernet, USB ports and a Quadro 1000M GPU (GF108 with 2 GB of memory, 96 CUDA cores, compute capability 2.1).
It has full OpenGL and CUDA support, but at the moment, no CUDA compiler.

The developer needs to cross-compile from a Linux x86 machine. This blog shows how easy it is to cross-compile once we follow some simple instructions. I strongly suggest that you start with an Ubuntu machine, the cross-compiler are easily available under this platform.

The first thing to do, it is to install the cross-compilers:

sudo apt-get install g++-arm-linux-gnueabi gcc-arm-linux-gnueabi

At this point, we will have the cross-compilers installed under  /usr/bin/arm-linux-gnueabi-gcc and  /usr/bin/arm-linux-gnueabi-g++.

The second step is to install the CUDA Toolkit for ARM on the x86. If you choose the default location,
the installer will create a directory /usr/local/cuda.

If you need to use other libraries for ARM, you will also need to copy the libraries and corresponding header files from CARMA to the x86 machine.  You can place them under /usr/local/arm_lib and /usr/local/arm_include or you can just put them under /usr/local/cuda/lib and /usr/local/cuda/include (my preference will be for the first option to not pollute the CUDA installation).

We are now ready to compile our code, taking care of using the cross compiler and the special nvcc in the CARMA toolkit.  The following makefile will show how to compile a simple c++ code that calls a CUBLAS function and a simple CUDA code.


############################
#  Makefile for cross-compile #
############################
all : dgemm_cublas simple_cuda

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32


# For a standard c++ code, we use CC and the CUDA ARM libraries
dgemm_cublas : gemm_test.cpp
$(CC)   gemm_test.cpp -I$(CUDA_HOME)/include -o dgemm_cublas -L/$(CUDA_HOME)/lib -lcudart -lcublas

# For a standard CUDA code, we just invoke nvcc
simple_cuda: file.cu
$(NVCC) -o simple_cuda file.cu

clean :
rm -f *.o dgemm_cublas simple_cuda


Once we generate the executable, since they are for ARM, we will not be able to execute them until we move them on CARMA.