Friday, November 27, 2015

Building TensorFlow for Jetson TK1

Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs. 

TensorFlow has a GPU  backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway.  This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download the wheel file I generated and do a pip install. The file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing ). 

TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new 

The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.

Java 8:
The first step is to install Java8, but this is quite simple since Oracle provides a package:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Protobuf:
In order to build protobuf and bazel, we will need several other packages. The exact list will  depend on the status of your Jetson,  but you will need at least these ones:

$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev  

After downloading  the latest source from github:

$ git clone https://github.com/google/protobuf.git

you need to first generate the configuration file and then run make:

$ cd protobuf
$ ./autogen.sh 
$ ./configure --prefix=/usr
$ make -j 4
$ sudo make install

Protoc will be installed in /usr/lib and /usr/bin, this will be important when we run bazel since it tries to use a sandbox and will not find the libraries in /usr/local/lib.

You should see this output, if you have followed all the steps:

ubuntu@tegra-ubuntu:~/protobuf$ protoc --version
libprotoc 3.0.0

We also need to build the java interface for protobuf, that will require Maven.
Luckily maven is available from the repos, so we can just issue a:

$ sudo apt-get install maven

Go to the subdirectory java inside protobuf and type:
$ mvn package

Once the build is completes, there will be  a protobuf-java-3.0.0-beta-1.jar inside the target subdirectory.

Bazel:

We are now ready to tackle Bazel.
The first step is to download the source code for Bazel ( using the 0.1.0 version, that it is known to work with Tensorflow). 

$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$git checkout tags/0.1.0

Before compiling, we need to copy the protoc binary we just built as third_party/protobuf/protoc-linux-arm32.exe.
We also need to copy the jar file from protobuf in the same directory. Bazel is expecting an alpha-3 version, but we have built a  beta-1.
There is probably a better way of doing this, but just copying the file and rename it did the trick for me.


$ cp /usr/bin/protoc   third_party/protobuf/protoc-linux-arm32.exe
$ cp ~/protobuf/java/target/protobuf-java-3.0.0-beta-1.jar  third_party/protobuf/protobuf-java-3.0.0-alpha-3.jar

We are now ready to compile bazel. 

$ ./compile.sh

At the end of the compilation, the bazel binary will be in the output directory. You can add this directory
to your path or copy the binary in /usr/local/bin

TensorFlow

We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1. 
You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from
the NVIDIA web site ( it is the one for Jetson TX1).
While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still  possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc. 
All the libraries and runtime will be the standard 6.5 ones. 


On my system I have also enabled some swap space. You can plug a USB memory stick,  create a swap file and mount it with
$ sudo mkswap /dev/sda
$ sudo swapon /dev/sda 

The first step to build TensorFlow is to clone the github repository:
$ git clone -recurse-submodules https://github.com/tensorflow/tensorflow 

and install other dependencies:
$ sudo apt-get install python-numpy swig python-dev

TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.
Before starting the installation, we will need to modify several files.  We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5.  We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow
$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'
$ grep -Rl "so.7.0"| xargs sed -i 's/so\.7\.0/so\.6\.5/g'


TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. 
This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.
When prompted, specify that you only want a 3.2 compute capability device.

$ TF_UNOFFICIAL_SETTING=1 ./configure

# Same as the official settings above

WARNING: You are configuring unofficial settings in TensorFlow. Because some
external libraries are not backward compatible, these settings are largely
untested and unsupported.

Please specify a list of comma-separated Cuda compute capabilities you want to
build with. You can find the compute capability of your device at:
https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases
your build time and binary size. [Default is: "3.5,5.2"]: 3.2

Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished


After the configure, bazel has copied or symlinked all the binaries and libraries needed for the build in  the third_party/gpus/cuda subdirectory .
It is now time to replace the cuda compiler with the one from the 7.0 toolchain.

We want to extract (not install) the files from the  cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb package with the following commands:

$ dpkg -x cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb /tmp/cuda_repo
$ cd /tmp/cuda_repo/var/cuda-repo-7-0-local
$ dpkg -x cuda-core-7-0_7.0-71_armhf.deb /tmp/cuda7.0
$ rm -fr /tmp/cuda_repo

$ cd ~tensorflow/third_party/gpus/cuda
$ rm -fr bin nvvm
$ cp -R  /tmp/cuda7.0/usr/local/cuda-7.0/bin bin
$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/nvvm nvvm
$ rm -fr /tmp/cuda7.0

At this point, bazel is ready to use the 7.0 toolchain to compile Tensorflow.

We still need to add the ARM target to the build. 
This can be done adding the following lines to the third_party/gpus/crosstool/CROSSTOOL file:

default_toolchain {
  cpu: "arm"
  toolchain_identifier: "local_linux"
}                                                                                                                                                                                                                                                

Before starting the build, we need to edit few files to avoid compiler crashes and avoid double instantiations 
(on ARM v7, Eigen::DenseIndex is  typedefed to int):

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h
tensorflow/core/kernels/conv_ops_gpu_2.cu.cc
tensorflow/core/kernels/conv_ops_gpu_3.cu.cc
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
tensorflow/core/kernels/adjust_contrast_op.h


third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h:  
the compiler is crashing when evaluating the code inside the ifdef at line 312. We can just take the alternative path.
         Change line 312 to something like:
#ifdef EIGEN_HAS_VARIADIC_TEMPLATES_TK1

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:
To avoid double instantiation, guard the second functor for InflateAnsShuffle with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::InflatePadAndShuffle
                                              Eigen::DenseIndex>;
#endif
We also need to comment the tensor.h include ( will crash the compiler)
//#include "tensorflow/core/public/tensor.h"

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:
To avoid double instantiation, guard the second functor  for ShuffleAndReverse with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::ShuffleAndReverse
                                           Eigen::DenseIndex>;
#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:
ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime:
FILE *file = fopen(filename.c_str(), "r");
  if (file == nullptr) {
    LOG(ERROR) << "could not open file to read NUMA node: " << filename;
#ifdef __arm__
    // There is no numa_node on Jetson TK1
    return 0;
#else
    return kUnknownNumaNode;
#endif


tensorflow/core/kernels/adjust_contrast_op.h:
The compiler is crashing on some initializations, we need to rewrite them in a simpler way:

//MF Eigen::array scalar_broadcast{{batch, height, width, channels}};
    Eigen::array scalar_broadcast;
    scalar_broadcast[0] = batch;
    scalar_broadcast[1] = height;
    scalar_broadcast[2] = width;
    scalar_broadcast[3] = channels;
#if !defined(EIGEN_HAS_INDEX_LIST)
//MF Eigen::array reduction_axis{{1, 2}};
//MF Eigen::array scalar{{1, 1, 1, 1}};
//MF Eigen::array broadcast_dims{{1, height, width, 1}};
//MF Eigen::Tensor::Dimensions reshape_dims{{batch, 1, 1, channels}};
     Eigen::array reduction_axis;
      reduction_axis[0]=1;
      reduction_axis[1]=2;
     Eigen::array scalar;
      scalar[0]=1;
      scalar[1]=1;
      scalar[2]=1;
      scalar[3]=1;
     Eigen::array broadcast_dims;
      broadcast_dims[0]=1;
      broadcast_dims[1]=height;
      broadcast_dims[2]=width;
      broadcast_dims[3]=1;
     Eigen::Tensor::Dimensions reshape_dims;
      reshape_dims[0]=batch;
      reshape_dims[1]=1;
      reshape_dims[2]=1;
      reshape_dims[3]=channels;
#else

The source code is now ready. Jeston TK1 has only 2GB of memory and bazel will try to compile several files at the same time.
We want to avoid this, so we will pass a local_resource flag that will use only 2GB and half core (don't ask, if you specify one it will still try
to compile two files at the same time). This build will take a long time:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer

If you get some failures during the build, keep trying, bazel scheduling seems to be non-deterministic and the Tensorflow code is really stressing the
compiler.

Once the build is completed, we can test the code:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]


We are now ready to create the pip package and install it:
# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

We can also try a more interesting example of image classification:
bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/examples/label_image/...

$ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip -O tensorflow/examples/label_image/data/inception5h.zip
$ unzip tensorflow/examples/label_image/data/inception5h.zip -d tensorflow/examples/label_image/data/
$ mv tensorflow/examples/label_image/data/tensorflow_inception_graph.pb tensorflow/examples/label_image/data/googlenet_graph.pb
$ mv tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt tensorflow/examples/label_image/data/googlenet_labels.txt 

And run it with:
$ bazel-bin/tensorflow/examples/label_image/label_image
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 1
E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:890] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: GK20A
major: 3 minor: 2 memoryClockRate (GHz) 0.852
pciBusID 0000:00:00.0
Total memory: 1.85GiB
Free memory: 218.46MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 18.46MiB bytes.
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0xa45ea000 extends to 0xa585f000
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)
I tensorflow/examples/label_image/main.cc:221] military uniform (866): 0.902268
I tensorflow/examples/label_image/main.cc:221] bow tie (817): 0.05407
I tensorflow/examples/label_image/main.cc:221] suit (794): 0.0113196
I tensorflow/examples/label_image/main.cc:221] bulletproof vest (833): 0.0100269
I tensorflow/examples/label_image/main.cc:221] bearskin (849): 0.00649747




Saturday, October 12, 2013

CUDA 5.5 and Xcode 5

The latest Xcode 5 update seems to have broken nvcc.
If you try to compile a  CUDA program, you will see a similar error:

%nvcc -c qr.cu
clang: error: unsupported option '-dumpspecs'
clang: error: no input files

There is a simple workaround.

%nvcc -ccbin=/usr/bin/clang -c qr.cu

A more convenient way of adding this, it is to define an alias for nvcc.
You can add this line to your .bash_profile

alias nvcc='nvcc -ccbin=/usr/bin/clang'

or just define it in your shell,

alias 'nvcc=nvcc -ccbin=/usr/bin/clang'



Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b,  have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu,  that adds a scalar to a vector:


__global__ void add(double * in, double a, int N)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N) {
       in[idx] += a;
   }
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

  • Generate a PTX  file from the kernel source
    • nvcc -ptx -arch sm_20 add.cu
  • Construct the kernel object from the PTX file
    • k=parallel.gpu.CUDAKernel('add.ptx','add.cu');
  • Set up the block and grid configuration, for example 28 blocks of 256 threads each:
    • k.ThreadBlockSize=[256 1 1]
    • k.GridSize=[28 1 1]
  • Execute the kernel.
    • o = feval(k,rand(10,1),2.,10)
    • The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about  CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
    implicit none
    double precision, intent(inout) :: a(*)
    double precision, value :: b
    integer , value :: N
    integer :: i

    i = threadIdx%x+(blockIdx%x-1)*blockDim%x
    if ( i <=N) a(i) = a(i)+b

 end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

               pgf90 -c -Mcuda=keepptx,cc20  addf.cuf
The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files,  you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

               setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

               !pgf90 -c -Mcuda=keepptx,cc20  addf.cuf


In order to load the PTX file in MATLAB, we need to slightly change the syntax.
When loading the PTX file generated by CUDA C, we were passing both the PTX file name and
the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel. 

This is what we need to load the PTX file generated from CUDA Fortran.

       kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

    • kf.ThreadBlockSize=[256 1 1]
    • kf.GridSize=[28 1 1]
and execute the kernel.
    • of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements
i1=gpuArray(rand(10,1))
% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')
% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]
% Execute the kernel
of=feval(kf,i1,10.,10)


An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.


Now that we understand all the steps, let's move to something more complex and discuss few more points.
We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock
( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).
The kernel is embedded in a module, since we are using a global variable for the lock. 
There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers 
for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none
  integer, parameter :: fp_kind = kind(0.0d0)
  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)
    real(fp_kind), intent(in) :: input(N)
    real(fp_kind) :: totalsum(1)
    integer,value :: N
    real(fp_kind), shared, dimension(256) :: psum
    integer :: i,index, inext
    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind
    do i=index,N,BlockDim%x*GridDim%x
       lsum = lsum+ input(i)
    end do

    ! Local reduction per block
    index=threadIdx%x

    psum(index)=lsum
    call syncthreads()

    inext=blockDim%x/2
    do while ( inext >=1 )
       if (index <=inext) psum(index)=psum(index)+psum(index+inext)
       inext = inext /2
       call syncthreads()
    end do

    ! Final reduction among block with atomic lock
    if (index == 1) then
       do while ( atomiccas(lock,0,1) == 1)
       end do
       totalsum(1)=totalsum(1)+psum(1)
       call threadfence()
       lock =0
    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}


The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.
When  the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.
While this is not important when the module contains a single subroutine, it is  crucial for situations in which multiple entry points are defined. If the module had  multiple subroutines,  we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')
Error using handleKernelArgs (line 61)
Found more than one entry point in the PTX code.  Possible names are:
sumgpu_sum_
sumgpu_sum2_


In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')
kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly.  However, with the  prototype signature we specified, the first array that in the original code was
with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'.  This is not a major problem, but we will
need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.
We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 1
       NumRHSArguments: 3
         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only. 
We are now ready to run the code:

%Generate an array of 1024 elements on the CPU
a=rand(1024,1);
% Copy the array to a GPU array ag
ag=gpuArray(a);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[28 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel
disp('CUDA Fortran kernel:')
sumg=feval(kf,ag,sumg,1024)
% Recompute the sum using the intrinsic MATLAB function
disp('Intrinsic MATLAB sum on GPU:')
sum_matlab=sum(ag)
%Check the result
disp('Difference:')
sumg-sum_matlab

obtaining the following output:
CUDA Fortran kernel:

sumg =

  509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

  509.2181

Difference:

ans =

     0

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.
We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU
seed=0;
gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);
parallel.gpu.RandStream.setGlobalStream(gpu_stream);
N=50000000;
%Generate the random numbers directly on the GPU
ag=gpuArray.randn(N,1);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[128 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel and time the execution
tic;sumg=feval(kf,ag,sumg,N);toc
% Invoke the intrinsic sum and time the execution
tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.
Elapsed time is 0.000357 seconds.
Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use
standalone code that was already written in a very productive environment such as MATLAB.




Monday, July 15, 2013

Enabling CUDA Multi Process Service (MPS) with multiple GPUs.

(Edited 10/21/13 to use MPS control daemon instead of MPS server)

CUDA 5.5 has a new interesting feature, called CUDA Multi Process Service (MPS), for GPUs with compute capability 3.5.

CUDA MPS, formerly known as CUDA Proxy,  is a feature that allows multiple CUDA processes to share a single GPU context. NVIDIA officially supports configurations with a single GPU, but it is possible to run it  on systems with multiple GPUs creating the MPS servers manually.
This post will show how to enable this feature when multiple GPUs are present in a system.
It is an unsupported but working configuration.

The first thing to do it is to create a MPS control daemon for each GPU.
We will use CUDA_VISIBLE_DEVICES to select each GPU and create two directories in /tmp for each MPS control daemon.  one for the pipe, the other for the log. By default, CUDA MPS will try to create a log directory in /var/log, requiring the control daemon to be executed with root privileges. By selecting a log directory in /tmp ( or any other directory of your choice that is accessible from normal users), we don't need root privileges to start the control daemons.

#!\bin\bash


# Number of gpus with compute_capability 3.5  per server
NGPUS=2

# Start the MPS server for each GPU
for ((i=0; i< $NGPUS; i++))

do
 mkdir /tmp/mps_$i
 mkdir /tmp/mps_log_$i
 export CUDA_VISIBLE_DEVICES=$i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i
 nvidia-cuda-mps-control -d
end do

Once we have set up the control daemons  we need to point the CUDA executable we want to run to the right MPS control daemon.  This is done in a non-standard way.
Instead of using the CUDA_VISIBLE_DEVICES variable, as normally done with CUDA, we will need to set CUDA_VISIBLE_DEVICES to 0 and select the explicit MPS pipe we want to use by specifying the proper CUDA_MPS_PIPE_DIRECTORY.


To start two instances of a.out on GPU 0 using proxy, we will type:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out

The execution script is a little more complex if we are running a MPI application.
In this case, we will need to find a way to detect how many MPI processes are running on a node.
OpenMPI has a variable that will tell us this info, other MPI implementations offer similar environment
variables.

This script shows how to run local process 0 and 2 on GPU 0 and 1 and 3 on GPU 1.

#!/bin/bash
#run script for MPI
export CUDA_VISIBLE_DEVICES=0
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
[0])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    ./executable
    ;;
[1])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1
    ./executable
    ;;

[2])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    ./executable
    ;;
[3])
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1
    ./executable
    ;;
esac

Once the execution is completed, we need to clean up the MPS control daemons  if other users are supposed to run on the system.

#!/bin/bash

# Stop the MPS control daemon for each GPU and clean up /tmp

for ((i=0; i< $NGPUS; i++))
do
echo $i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 echo "quit" | nvidia-cuda-mps-control
 rm -fr /tmp/mps_$i
 rm -fr /tmp/mps_log_$i
done

The creation and clean-up could be combined in a single script.