CUDA Musing

TensorFlow 0.8 on Jetson TK1

2016-06-17T11:59:00.003-07:00

This post gives updated instructions on how to build TensorFlow 0.8 on Jetson TK1 now that NVIDIA has released a new compiler that can handle the variadic templates without compiler internal errors.

If you just want to try to install the whl file, this is a direct link, tensorflow-0.8.0-cp27-none-linux_armv7l.whl

I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.

Install the 7.0.76 compiler:

Before starting, you will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:

wget http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

Now we can install it as usual:

sudo dpkg -i cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

sudo apt-get update

sudo apt-get install cuda-toolkit-7-0

At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5:

ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda

ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda

You should see this output:

ubuntu@tegra-ubuntu:~$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Built on Fri_Dec_12_11:12:07_CST_2014

Cuda compilation tools, release 6.5, V6.5.35

ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Built on Mon_Feb_22_15:38:26_CST_2016

Cuda compilation tools, release 7.0, V7.0.74

Install protobuf and Bazel:

For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).

Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:

$ git clone https://github.com/bazelbuild/bazel.git

$ cd bazel

$ git checkout tags/0.1.4

Install TensorFlow 0.8:

The first thing to do it is to check out the source code and select the proper version:

$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow

$ cd tensorflow

$ git checkout r0.8

TensorFlow is expecting a 64bit system, we will need to change all the reference from lib64 to lib. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow

$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2.

This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.

When prompted, specify that you only want a 3.2 compute capability device.

ubuntu@tegra-ubuntu:~/tensorflow$ TF_UNOFFICIAL_SETTING=1 ./configure

Please specify the location of python. [Default is /usr/bin/python]: 

Do you wish to build TensorFlow with GPU support? [y/N] y

GPU support will be enabled for TensorFlow

Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 

Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 

Please specify the location where CUDA  toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the Cudnn version you want to use. [Leave empty to use system default]: 

Please specify the location where cuDNN  library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify a list of comma-separated Cuda compute capabilities you want to build with.

You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.

Please note that each additional compute capability significantly increases your build time and binary size.

[Default is: "3.5,5.2"]: 3.2

Setting up Cuda include

Setting up Cuda lib

Setting up Cuda bin

Setting up Cuda nvvm

Configuration finished

Now that the initial set up is done, it is time to change the compiler used by Bazel.

ubuntu@tegra-ubuntu:~/tensorflow$ cd third_party/gpus/cuda/

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ rm -fr bin nvvm

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/bin/ bin

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/nvvm/ nvvm

Before starting the build ( that is going to take a very long time), we will need to modify few files.

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:

To avoid double instantiation, guard the second functor for InflatePadAndShuffle with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::InflatePadAndShufflefloat

, 4,

                                              Eigen::DenseIndex>;

#endif 

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:

To avoid double instantiation, guard the second functor for ShuffleAndReverse with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::ShuffleAndReversefloat

, 4,

                                           Eigen::DenseIndex>;

#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:

ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime. You can use the modification from the previous post or the following code:

static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {

#ifdef __arm__

  LOG(INFO) << "ARMV7 does not support NUMA - returning NUMA node zero";

  return 0;

#else

 ........

  return kUnknownNumaNode;

#endif

}

tensorflow/core/common_runtime/gpu/process_state.cc:

this is a new memory allocator, that is going to cause a floating point exception unless you change the following code:

if (kCudaHostMemoryUseBFC) {

      allocator =

#ifdef __arm__

          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 31,

                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);

#else

          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 36 /*64GB max*/,

                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);

#endif

    } else {

We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to start the build and then wait for the first failure:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures -s --config=cuda //tensorflow/cc:tutorials_example_trainer

If on your first compile of tensorflow you get the following error:

ERROR: /home/ubuntu/tensorflow/tensorflow/cc/BUILD:61:1: error loading package 'tensorflow/core': Extension file not found. Unable to load package for '//google/protobuf:protobuf.bzl': BUILD file not found on package path and referenced by '//tensorflow/cc:tutorials_example_trainer'.

You need to init update in the tensorflow repository to get the google/protobuf clone using:

git submodule update --init

At this point, I can edit the file Macros.h in Eigen.

This file is located in the .cache directory:

ubuntu@tegra-ubuntu:~/.cache$ find . -name Macros.h -print

./bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/eigen_archive/eigen-eigen-3f653ace7d28/Eigen/src/Core/util/Macros.h

The nvcc check needs to be eliminated:

-#if !defined(__NVCC__) || !defined(EIGEN_ARCH_ARM_OR_ARM64)

 #define EIGEN_HAS_VARIADIC_TEMPLATES 1

 #endif

-#endif

We can now restart the build and it will go through.

After you are done, you can test it with:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:

# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.8.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

Most of the tests are passing, but the image classification example is giving the wrong results. Now that the community can build it and play with it, someone can find the source of the error(s).

I downloaded the python files from TensorFlow-Tutorial and they seem to work:

git clone https://github.com/nlintz/TensorFlow-Tutorials.git

Building TensorFlow for Jetson TK1

2015-11-27T13:09:00.002-08:00

Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs.

TensorFlow has a GPU backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway. This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download the wheel file I generated and do a pip install. The file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing ).

TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new

The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.

Java 8:

The first step is to install Java8, but this is quite simple since Oracle provides a package:

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java8-installer

Protobuf:

In order to build protobuf and bazel, we will need several other packages. The exact list will depend on the status of your Jetson, but you will need at least these ones:

$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev

After downloading the latest source from github:

$ git clone https://github.com/google/protobuf.git

you need to first generate the configuration file and then run make:

$ cd protobuf

$ ./autogen.sh

$ ./configure --prefix=/usr

$ make -j 4

$ sudo make install

Protoc will be installed in /usr/lib and /usr/bin, this will be important when we run bazel since it tries to use a sandbox and will not find the libraries in /usr/local/lib.

You should see this output, if you have followed all the steps:

ubuntu@tegra-ubuntu:~/protobuf$ protoc --version

libprotoc 3.0.0

We also need to build the java interface for protobuf, that will require Maven.

Luckily maven is available from the repos, so we can just issue a:

$ sudo apt-get install maven

Go to the subdirectory java inside protobuf and type:

$ mvn package

Once the build is completes, there will be a protobuf-java-3.0.0-beta-1.jar inside the target subdirectory.

Bazel:

We are now ready to tackle Bazel.

The first step is to download the source code for Bazel ( using the 0.1.0 version, that it is known to work with Tensorflow).

$ git clone https://github.com/bazelbuild/bazel.git

$ cd bazel

$git checkout tags/0.1.0

Before compiling, we need to copy the protoc binary we just built as third_party/protobuf/protoc-linux-arm32.exe.

We also need to copy the jar file from protobuf in the same directory. Bazel is expecting an alpha-3 version, but we have built a beta-1.

There is probably a better way of doing this, but just copying the file and rename it did the trick for me.

$ cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe

$ cp ~/protobuf/java/target/protobuf-java-3.0.0-beta-1.jar third_party/protobuf/protobuf-java-3.0.0-alpha-3.jar

We are now ready to compile bazel.

$ ./compile.sh

At the end of the compilation, the bazel binary will be in the output directory. You can add this directory

to your path or copy the binary in /usr/local/bin

TensorFlow

We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1.

You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from

the NVIDIA web site ( it is the one for Jetson TX1).

While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc.

All the libraries and runtime will be the standard 6.5 ones.

On my system I have also enabled some swap space. You can plug a USB memory stick, create a swap file and mount it with

$ sudo mkswap /dev/sda

$ sudo swapon /dev/sda

The first step to build TensorFlow is to clone the github repository:

$ git clone -recurse-submodules https://github.com/tensorflow/tensorflow

and install other dependencies:

$ sudo apt-get install python-numpy swig python-dev

TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.

Before starting the installation, we will need to modify several files. We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow

$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

$ grep -Rl "so.7.0"| xargs sed -i 's/so\.7\.0/so\.6\.5/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2.

This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.

When prompted, specify that you only want a 3.2 compute capability device.

$ TF_UNOFFICIAL_SETTING=1 ./configure

# Same as the official settings above

WARNING: You are configuring unofficial settings in TensorFlow. Because some

external libraries are not backward compatible, these settings are largely

untested and unsupported.

Please specify a list of comma-separated Cuda compute capabilities you want to

build with. You can find the compute capability of your device at:

https://developer.nvidia.com/cuda-gpus.

Please note that each additional compute capability significantly increases

your build time and binary size. [Default is: "3.5,5.2"]: 3.2

Setting up Cuda include

Setting up Cuda lib

Setting up Cuda bin

Setting up Cuda nvvm

Configuration finished

After the configure, bazel has copied or symlinked all the binaries and libraries needed for the build in the third_party/gpus/cuda subdirectory .

It is now time to replace the cuda compiler with the one from the 7.0 toolchain.

We want to extract (not install) the files from the cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb package with the following commands:

$ dpkg -x cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb /tmp/cuda_repo

$ cd /tmp/cuda_repo/var/cuda-repo-7-0-local

$ dpkg -x cuda-core-7-0_7.0-71_armhf.deb /tmp/cuda7.0

$ rm -fr /tmp/cuda_repo

$ cd ~tensorflow/third_party/gpus/cuda

$ rm -fr bin nvvm

$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/bin bin

$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/nvvm nvvm

$ rm -fr /tmp/cuda7.0

At this point, bazel is ready to use the 7.0 toolchain to compile Tensorflow.

We still need to add the ARM target to the build.

This can be done adding the following lines to the third_party/gpus/crosstool/CROSSTOOL file:

default_toolchain {

cpu: "arm"

toolchain_identifier: "local_linux"

}

Before starting the build, we need to edit few files to avoid compiler crashes and avoid double instantiations

(on ARM v7, Eigen::DenseIndex is typedefed to int):

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc

tensorflow/core/kernels/adjust_contrast_op.h

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h:

the compiler is crashing when evaluating the code inside the ifdef at line 312. We can just take the alternative path.

Change line 312 to something like:

#ifdef EIGEN_HAS_VARIADIC_TEMPLATES_TK1

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:

To avoid double instantiation, guard the second functor for InflateAnsShuffle with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::InflatePadAndShuffle

Eigen::DenseIndex>;

#endif

We also need to comment the tensor.h include ( will crash the compiler)

//#include "tensorflow/core/public/tensor.h"

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:

To avoid double instantiation, guard the second functor for ShuffleAndReverse with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::ShuffleAndReverse

Eigen::DenseIndex>;

#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:

ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime:

FILE *file = fopen(filename.c_str(), "r");

if (file == nullptr) {

LOG(ERROR) << "could not open file to read NUMA node: " << filename;

#ifdef __arm__

// There is no numa_node on Jetson TK1

return 0;

#else

return kUnknownNumaNode;

#endif

tensorflow/core/kernels/adjust_contrast_op.h:

The compiler is crashing on some initializations, we need to rewrite them in a simpler way:

//MF Eigen::array scalar_broadcast{{batch, height, width, channels}};

Eigen::array scalar_broadcast;

scalar_broadcast[0] = batch;

scalar_broadcast[1] = height;

scalar_broadcast[2] = width;

scalar_broadcast[3] = channels;

#if !defined(EIGEN_HAS_INDEX_LIST)

//MF Eigen::array reduction_axis{{1, 2}};

//MF Eigen::array scalar{{1, 1, 1, 1}};

//MF Eigen::array broadcast_dims{{1, height, width, 1}};

//MF Eigen::Tensor::Dimensions reshape_dims{{batch, 1, 1, channels}};

Eigen::array reduction_axis;

reduction_axis[0]=1;

reduction_axis[1]=2;

Eigen::array scalar;

scalar[0]=1;

scalar[1]=1;

scalar[2]=1;

scalar[3]=1;

Eigen::array broadcast_dims;

broadcast_dims[0]=1;

broadcast_dims[1]=height;

broadcast_dims[2]=width;

broadcast_dims[3]=1;

Eigen::Tensor::Dimensions reshape_dims;

reshape_dims[0]=batch;

reshape_dims[1]=1;

reshape_dims[2]=1;

reshape_dims[3]=channels;

#else

The source code is now ready. Jeston TK1 has only 2GB of memory and bazel will try to compile several files at the same time.

We want to avoid this, so we will pass a local_resource flag that will use only 2GB and half core (don't ask, if you specify one it will still try

to compile two files at the same time). This build will take a long time:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer

If you get some failures during the build, keep trying, bazel scheduling seems to be non-deterministic and the Tensorflow code is really stressing the

compiler.

Once the build is completed, we can test the code:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of

# a 2x2 matrix, on GPU. The last few lines look like this.

000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:

# To build with GPU support:

$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.

$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

We can also try a more interesting example of image classification:

bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/examples/label_image/...

$ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip -O tensorflow/examples/label_image/data/inception5h.zip

$ unzip tensorflow/examples/label_image/data/inception5h.zip -d tensorflow/examples/label_image/data/

$ mv tensorflow/examples/label_image/data/tensorflow_inception_graph.pb tensorflow/examples/label_image/data/googlenet_graph.pb

$ mv tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt tensorflow/examples/label_image/data/googlenet_labels.txt 

And run it with:

$ bazel-bin/tensorflow/examples/label_image/label_image

I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 1

E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:890] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node

I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:

name: GK20A

major: 3 minor: 2 memoryClockRate (GHz) 0.852

pciBusID 0000:00:00.0

Total memory: 1.85GiB

Free memory: 218.46MiB

I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0

I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y

I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 18.46MiB bytes.

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0xa45ea000 extends to 0xa585f000

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB

I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 1

I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)

I tensorflow/examples/label_image/main.cc:221] military uniform (866): 0.902268

I tensorflow/examples/label_image/main.cc:221] bow tie (817): 0.05407

I tensorflow/examples/label_image/main.cc:221] suit (794): 0.0113196

I tensorflow/examples/label_image/main.cc:221] bulletproof vest (833): 0.0100269

I tensorflow/examples/label_image/main.cc:221] bearskin (849): 0.00649747

CUDA 5.5 and Xcode 5

2013-10-12T16:18:00.001-07:00

The latest Xcode 5 update seems to have broken nvcc.

If you try to compile a CUDA program, you will see a similar error:

%nvcc -c qr.cu

clang: error: unsupported option '-dumpspecs'

clang: error: no input files

There is a simple workaround.

%nvcc -ccbin=/usr/bin/clang -c qr.cu

A more convenient way of adding this, it is to define an alias for nvcc.

You can add this line to your .bash_profile

alias nvcc='nvcc -ccbin=/usr/bin/clang'

or just define it in your shell,

alias 'nvcc=nvcc -ccbin=/usr/bin/clang'

Calling CUDA Fortran kernels from MATLAB

2013-09-11T12:39:00.002-07:00

The latest MATLAB versions, starting from 2010b, have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:

__global__ void add(double * in, double a, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
in[idx] += a;
}
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

Generate a PTX file from the kernel source

nvcc -ptx -arch sm_20 add.cu

Construct the kernel object from the PTX file

k=parallel.gpu.CUDAKernel('add.ptx','add.cu');

Set up the block and grid configuration, for example 28 blocks of 256 threads each:

k.ThreadBlockSize=[256 1 1]
k.GridSize=[28 1 1]

Execute the kernel.

o = feval(k,rand(10,1),2.,10)
The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
implicit none
double precision, intent(inout) :: a(*)
double precision, value :: b
integer , value :: N
integer :: i

i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if ( i <=N) a(i) = a(i)+b

end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.

When loading the PTX file generated by CUDA C, we were passing both the PTX file name and

the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel.

This is what we need to load the PTX file generated from CUDA Fortran.

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

and execute the kernel.

of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements

i1=gpuArray(rand(10,1))

% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')

% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

% Execute the kernel
of=feval(kf,i1,10.,10)

An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.

Now that we understand all the steps, let's move to something more complex and discuss few more points.

We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock

( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).

The kernel is embedded in a module, since we are using a global variable for the lock.

There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers

for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none

  integer, parameter :: fp_kind = kind(0.0d0)

  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)

    real(fp_kind), intent(in) :: input(N)

    real(fp_kind) :: totalsum(1)

    integer,value :: N

    real(fp_kind), shared, dimension(256) :: psum

    integer :: i,index, inext

    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind

    do i=index,N,BlockDim%x*GridDim%x

       lsum = lsum+ input(i)

    end do

    ! Local reduction per block

    index=threadIdx%x

    psum(index)=lsum

    call syncthreads()

    inext=blockDim%x/2

    do while ( inext >=1 )

       if (index <=inext) psum(index)=psum(index)+psum(index+inext)

       inext = inext /2

       call syncthreads()

    end do

    ! Final reduction among block with atomic lock

    if (index == 1) then

       do while ( atomiccas(lock,0,1) == 1)

       end do

       totalsum(1)=totalsum(1)+psum(1)

       call threadfence()

       lock =0

    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.

When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.

While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

Error using handleKernelArgs (line 61)

Found more than one entry point in the PTX code.  Possible names are:

sumgpu_sum_

sumgpu_sum2_

In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly. However, with the prototype signature we specified, the first array that in the original code was

with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'. This is not a major problem, but we will

need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.

We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 1

       NumRHSArguments: 3

         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only.

We are now ready to run the code:

%Generate an array of 1024 elements on the CPU

a=rand(1024,1);

% Copy the array to a GPU array ag

ag=gpuArray(a);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[28 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel

disp('CUDA Fortran kernel:')

sumg=feval(kf,ag,sumg,1024)

% Recompute the sum using the intrinsic MATLAB function

disp('Intrinsic MATLAB sum on GPU:')

sum_matlab=sum(ag)

%Check the result

disp('Difference:')

sumg-sum_matlab

obtaining the following output:

CUDA Fortran kernel:

sumg =

509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

509.2181

Difference:

ans =

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.

We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU

seed=0;

gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);

parallel.gpu.RandStream.setGlobalStream(gpu_stream);

N=50000000;

%Generate the random numbers directly on the GPU

ag=gpuArray.randn(N,1);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[128 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel and time the execution

tic;sumg=feval(kf,ag,sumg,N);toc

% Invoke the intrinsic sum and time the execution

tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.

Elapsed time is 0.000357 seconds.

Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use

standalone code that was already written in a very productive environment such as MATLAB.

Enabling CUDA Multi Process Service (MPS) with multiple GPUs.

2013-07-15T16:43:00.001-07:00

(Edited 05/09/2016)
CUDA 7 introduced MPS support for multi GPU nodes.
CUDA_VISIBLE_DEVICE should not be used to handle GPU affinity when a CUDA-aware MPI is used, because of issues with CUDA IPC.

(Edited 10/21/13 to use MPS control daemon instead of MPS server)

CUDA 5.5 has a new interesting feature, called CUDA Multi Process Service (MPS), for GPUs with compute capability 3.5.

CUDA MPS, formerly known as CUDA Proxy, is a feature that allows multiple CUDA processes to share a single GPU context. NVIDIA officially supports configurations with a single GPU, but it is possible to run it on systems with multiple GPUs creating the MPS servers manually.
This post will show how to enable this feature when multiple GPUs are present in a system.
It is an unsupported but working configuration.

The first thing to do it is to create a MPS control daemon for each GPU.
We will use CUDA_VISIBLE_DEVICES to select each GPU and create two directories in /tmp for each MPS control daemon. one for the pipe, the other for the log. By default, CUDA MPS will try to create a log directory in /var/log, requiring the control daemon to be executed with root privileges. By selecting a log directory in /tmp ( or any other directory of your choice that is accessible from normal users), we don't need root privileges to start the control daemons.

#!\bin\bash

# Number of gpus with compute_capability 3.5 per server
NGPUS=2

# Start the MPS server for each GPU
for ((i=0; i< $NGPUS; i++))

do
mkdir /tmp/mps_$i
mkdir /tmp/mps_log_$i
export CUDA_VISIBLE_DEVICES=$i
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i
nvidia-cuda-mps-control -d
end do

Once we have set up the control daemons we need to point the CUDA executable we want to run to the right MPS control daemon. This is done in a non-standard way.
Instead of using the CUDA_VISIBLE_DEVICES variable, as normally done with CUDA, we will need to set CUDA_VISIBLE_DEVICES to 0 and select the explicit MPS pipe we want to use by specifying the proper CUDA_MPS_PIPE_DIRECTORY.

To start two instances of a.out on GPU 0 using proxy, we will type:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
./a.out

The execution script is a little more complex if we are running a MPI application.
In this case, we will need to find a way to detect how many MPI processes are running on a node.
OpenMPI has a variable that will tell us this info, other MPI implementations offer similar environment
variables.

This script shows how to run local process 0 and 2 on GPU 0 and 1 and 3 on GPU 1.

#!/bin/bash

#run script for MPI

export CUDA_VISIBLE_DEVICES=0

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

case ${lrank} in

[0])

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0

./executable

;;

[1])

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1

./executable

;;

[2])

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0

./executable

;;

[3])

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1

./executable

;;

esac

Once the execution is completed, we need to clean up the MPS control daemons if other users are supposed to run on the system.

#!/bin/bash

# Stop the MPS control daemon for each GPU and clean up /tmp

for ((i=0; i< $NGPUS; i++))
do
echo $i
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
echo "quit" | nvidia-cuda-mps-control
rm -fr /tmp/mps_$i
rm -fr /tmp/mps_log_$i
done

The creation and clean-up could be combined in a single script.

Using Thrust on CARMA

2012-11-21T09:59:00.002-08:00

Thurst is an excellent library for CUDA development.

Unfortunately, Thrust is not present in the CARMA Toolkit but it is easy to install.

On the x86 development system, we are going to pull down the latest source from Thrust using git.

If git is not installed, we can easily add to the system with:

sudo apt-get install git

and then clone the git repository

git clone https://github.com/thrust/thrust.git

We are now ready to cross-compile. Remember that Thrust is a template library, everything is build from include files.

Using our standard Makefile, we just need to add the directory in which the Thrust include files are ( in this case /home/ubuntu/thrust).

We also want to restrict the code generation to arch sm_21 ( the CARMA kit has a Q1000m GPU with 2.1 compute capabilities) to reduce the compilation time.

We are going to use one of the examples shipping with Thrust, monte_carlo.cu

############################
# Makefile for cross-compile #
############################
all : monte_carlo

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32
THRUST_LOC=/home/ubuntu/thrust

monte_carlo : monte_carlo.cu
$(NVCC) -O3 -arch sm_21 -o monte_carlo -I$(THRUST_LOC) monte_carlo.cu

clean:
rm monte_carlo

Once we generate the executable, we can copy it on the CARMA

scp monte_carlo ubuntu@carma:~

and execute it. We will see the number pi printed with 2 digits ( 3.14).

If you want to see more digits, you can change the source code and set the precision to 6 instead of the original 2

std::cout << std::setprecision(6);

Setting up a CARMA kit

2012-10-29T16:51:00.000-07:00

I just received a brand new CARMA kit and I am going to post all the steps I did to get a working set-up.

Let's start with the x86 development system. I am using a virtual machine on my Mac as my development system.

I started by installing a fresh Ubuntu 11.04 distro and then proceed to :

Update the packages:

sudo apt-get update

Install the basic developer tools:

sudo apt-get install build-essential

Install the 32bit development libraries ( CARMA is 32bit ):

sudo apt-get install ia32-libs

Install the ARM cross compilers:

sudo apt-get install gcc-4.5-arm-linux-gnueabi g++-4.5-arm-linux-gnueabi

Install Fortran for both x86 and ARM (real developers use Fortran....):

sudo apt-get install gfortran-4.5-*

Install the CUDA Toolkit (available from http://www.seco.com/carmakit under the downloads tab):

sudo sh cuda-linux-ARMv7-rel-4.2.10-13489154.run

Edit .bashrc to add nvcc to the path. With your favorite editor add a line at the end of the file:

export PATH=/usr/local/cuda/bin:$PATH

Source the .bashrc to refresh the path ( it will be automatically executed the next time you login or open a terminal):

. .bashrc

We can check that nvcc is now in our path, invoking the compiler with the -V flag to check the version

max@ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Tue_Jul_17_14:48:12_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

We are now ready to compile our first CUDA code, a comparison between multiplications on CPU and GPU.

#include "stdio.h"

__global__ void kernel(int i, float *d_n)
{
*d_n *= 1.02f;
}

void main(){
float n = 1.0f, *d_n;
float n_ref = 1.0f;
int i;
cudaMalloc((void **)&d_n, sizeof(float));
for(i = 1; i <= 10; i++)
{
cudaMemcpy(d_n, &n, sizeof(float), cudaMemcpyHostToDevice);
kernel <<< 1, 1 >>> (i, d_n);
cudaMemcpy(&n, d_n, sizeof(float), cudaMemcpyDeviceToHost);
printf("%d\t\t%42.41f\t%42.41f\n", i, n,n_ref*=1.02f);
}
}

We are going to use a Makefile similar to the one posted in the previous blog.

max@ubuntu:~$ cat Makefile
############################
# Makefile for cross-compile #
############################
all : gpu_test

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

gpu_test : gpu_test.cu
$(NVCC) -o gpu_test gpu_test.cu

clean:
rm gpu_test

When we type make, we should see a similar output

max@ubuntu:~$ make
/usr/local/cuda/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32 -o gpu_test gpu_test.cu
/usr/lib/gcc/arm-linux-gnueabi/4.5.2/../../../../arm-linux-gnueabi/bin/ld: warning: libc.so, needed by /usr/arm-linux-gnueabi/lib//libgcc_s.so.1, not found (try using -rpath or -rpath-link)

Don't worry about the warning. This is caused by a bogus DT_NEEDED entry in the shared libgcc file /usr/arm-linux-gnueabi/lib/libgcc_s.so.1. "readelf -a" shows:
0x00000001 (NEEDED) Shared library: [libc.so]

Before we could use the machine for any real CUDA development, there is an extra step that we will need to perform. The CUDA Toolkit is missing the libcuda.so ( it usually comes with the driver on the x86 platform, don't ask me why it was not included in the ARM toolkit), we will not be able to link any CUDA code before we bring this library to the x86. We will do this step once we have the CARMA up and running.

Unpack the CARMA, plugin keyboard and mouse, plus the HDMI cable in the middle connector.

Plug in the power and ethernet cable and you are ready to go.

The first boot may be slow, the system is building the NVIDIA driver. It is a blind boot, there is no console output until the GUI comes up, so you need to have a little bit of patience.

Once the CARMA system boots, it will auto-login and start a terminal. It should also pick up an IP address ( use ifconfig to find out the IP). The default username/password is ubuntu/ubuntu.

We are ready to check if our cross-compilation worked.

From inside the virtual machine, we will copy the file gpu_test to the CARMA ( ipconfig is reporting

172.16.174.185 ):

scp gpu_test ubuntu@172.16.174.185 :~

Either from the CARMA terminal or from a remote shell, we can run gpu_test and check that the CPU and GPU results are the same.

ubuntu@tegra-ubuntu:~$ ./gpu_test

1 1.01999998092651367187500000000000000000000 1.01999998092651367187500000000000000000000

2 1.04039990901947021484375000000000000000000 1.04039990901947021484375000000000000000000

3 1.06120789051055908203125000000000000000000 1.06120789051055908203125000000000000000000

4 1.08243203163146972656250000000000000000000 1.08243203163146972656250000000000000000000

5 1.10408067703247070312500000000000000000000 1.10408067703247070312500000000000000000000

6 1.12616229057312011718750000000000000000000 1.12616229057312011718750000000000000000000

7 1.14868545532226562500000000000000000000000 1.14868545532226562500000000000000000000000

8 1.17165911197662353515625000000000000000000 1.17165911197662353515625000000000000000000

9 1.19509232044219970703125000000000000000000 1.19509232044219970703125000000000000000000

10 1.21899414062500000000000000000000000000000 1.21899414062500000000000000000000000000000

The CARMA filesystem is quite bare, let's add few useful packages:

Install Fortran:

sudo apt-get install gfortran

We need to install OpenMPI from source, the default packages don't seem to work.

The latest source (1.6.2) has support for ARM, the installation is very simple but it will take a while.

Get the latest stable version

wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.2.tar.gz

unpack it ( tar xvfz openmpi-1.6.2.tar.gz) and change the directory ( cd openmpi-1.6.2 )

We are now ready to build and install
./configure

sudo make -j 4 install

Add /usr/local/bin to your PATH and /usr/local/lib to your LD_LIBRARY_PATH

Compiling for CARMA

2012-09-30T17:43:00.001-07:00

In few days, CARMA will be finally available to the general public. If you are not familiar with the CARMA project, it is the first ARM platform supporting CUDA.

It has a Tegra 3 with 4 cores and 2 GB of memory, ethernet, USB ports and a Quadro 1000M GPU (GF108 with 2 GB of memory, 96 CUDA cores, compute capability 2.1).

It has full OpenGL and CUDA support, but at the moment, no CUDA compiler.

The developer needs to cross-compile from a Linux x86 machine. This blog shows how easy it is to cross-compile once we follow some simple instructions. I strongly suggest that you start with an Ubuntu machine, the cross-compiler are easily available under this platform.

The first thing to do, it is to install the cross-compilers:

sudo apt-get install g++-arm-linux-gnueabi gcc-arm-linux-gnueabi

At this point, we will have the cross-compilers installed under /usr/bin/arm-linux-gnueabi-gcc and /usr/bin/arm-linux-gnueabi-g++.

The second step is to install the CUDA Toolkit for ARM on the x86. If you choose the default location,
the installer will create a directory /usr/local/cuda.

If you need to use other libraries for ARM, you will also need to copy the libraries and corresponding header files from CARMA to the x86 machine. You can place them under /usr/local/arm_lib and /usr/local/arm_include or you can just put them under /usr/local/cuda/lib and /usr/local/cuda/include (my preference will be for the first option to not pollute the CUDA installation).

We are now ready to compile our code, taking care of using the cross compiler and the special nvcc in the CARMA toolkit. The following makefile will show how to compile a simple c++ code that calls a CUBLAS function and a simple CUDA code.

############################
# Makefile for cross-compile #
############################
all : dgemm_cublas simple_cuda

CUDA_HOME=/usr/local/cuda
CC=/arm-linux-gnueabi-gcc
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

# For a standard c++ code, we use CC and the CUDA ARM libraries
dgemm_cublas : gemm_test.cpp
$(CC) gemm_test.cpp -I$(CUDA_HOME)/include -o dgemm_cublas -L/$(CUDA_HOME)/lib -lcudart -lcublas

# For a standard CUDA code, we just invoke nvcc
simple_cuda: file.cu
$(NVCC) -o simple_cuda file.cu

clean :
rm -f *.o dgemm_cublas simple_cuda

Once we generate the executable, since they are for ARM, we will not be able to execute them until we move them on CARMA.

MPI communications from GPU memory

2011-11-12T09:17:00.000-08:00

There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.

Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:

Transfer data from GPU memory to host memory on system A
Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv
Transfer data from host memory to GPU memory on system B

could just issue the MPI_Send/Recv with the buffers located on GPU memory.

A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).

In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.

After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.

We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).

FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure

--prefix=/share/apps/mvapich2-gpu

--enable-cuda

--with-cuda-include=/usr/local/cuda/include

--with-cuda-libpath=/usr/local/cuda/lib64

The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.

We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.

program mpi_test_gpu

use mpi

integer, allocatable:: a(:)

integer, device,allocatable:: a_d(:),b_d(:)

integer:: N, ierr, rank, num_procs, status(MPI_Status_size)

call MPI_Init (ierr)

call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

N=4

allocate (a(N),a_d(N),b_d(N))

a_d=(rank+1)*10

b_d=(rank-1)*100

a=-999

if ( rank == 0) then

call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)

else

call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

end if

if (rank == 1) a=b_d

print *,"Rank=",rank,"A=",a

deallocate (a,a_d,b_d)

call MPI_Finalize ( ierr )

end program mpi_test_gpu

If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:

mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf

We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add

export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).

We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.

mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu

Rank= 0 A= -999 -999 -999 -999

Rank= 1 A= 10 10 10 10

As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.

MVAPICH also allows to send data from GPU to host memory and vice versa.

For example we could replace the lines:

! Receive data to GPU array b_d from processor 0

call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

...

! Copy GPU array b_d to CPU array a

if (rank == 1) a=b_d

directly with

! Receive data to CPU array a from processor 0

call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)

CUDA, MPI and Infiniband

2011-08-16T17:14:00.001-07:00

There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).

Let's start with a simple program using 2 MPI processes that:

allocate data on the CPU and GPU
initialize the data on the CPU
copy the data on the GPU
transfer the host data from one process to the other

The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.



#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <sys/time.h>

#include <mpi.h>



#define NREPEAT 10

#define NBYTES  10.e6



int main (int argc, char *argv[])

{

int rank, size, n, len, numbytes;

void *a_h, *a_d;

struct timeval time[2];

double bandwidth;

char name[MPI_MAX_PROCESSOR_NAME];

MPI_Status status;



MPI_Init (&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Comm_size (MPI_COMM_WORLD, &size);



MPI_Get_processor_name(name, &len);

printf("Process %d is on %s\n", rank, name);



printf("Using regular memory \n");

a_h = malloc(NBYTES);



cudaMalloc( (void **) &a_d, NBYTES);



/* Test host -> device bandwidth. */

MPI_Barrier(MPI_COMM_WORLD);



gettimeofday(&time[0], NULL);

for (n=0; n<NREPEAT; n  )

{

cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);

}

gettimeofday(&time[1], NULL);



bandwidth  =        time[1].tv_sec  - time[0].tv_sec;

bandwidth  = 1.e-6*(time[1].tv_usec - time[0].tv_usec);

bandwidth  = NBYTES*NREPEAT/1.e6/bandwidth;



printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);



/* Test MPI send/recv bandwidth. */

MPI_Barrier(MPI_COMM_WORLD);



gettimeofday(&time[0], NULL);

for (n=0; n<NREPEAT; n  )

{

if (rank == 0)

MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);

else

MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);

}

gettimeofday(&time[1], NULL);



bandwidth  =        time[1].tv_sec  - time[0].tv_sec;

bandwidth  = 1.e-6*(time[1].tv_usec - time[0].tv_usec);

bandwidth  = NBYTES*NREPEAT/1.e6/bandwidth;



if (rank == 0)

printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);



cudaFree(a_d);

free(a_h);



MPI_Finalize();

return 0;

}

Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:

mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart

Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:

#mpirun -np 2 -host c0-0,c0-1 mpi_malloc

Process 0 is on compute-0-0.local

Using regular memory

Process 1 is on compute-0-1.local

Using regular memory

Host->device bandwidth for process 0: 4699.248120 MB/sec

Host->device bandwidth for process 1: 4323.950361 MB/sec

MPI send/recv bandwidth: 2467.369044 MB/sec

Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.

#mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 1 is on compute-0-1.local

Using pinned memory

Process 0 is on compute-0-0.local

Using pinned memory

Host->device bandwidth for process 0: 5927.330923 MB/sec

Host->device bandwidth for process 1: 5909.117769 MB/sec

If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.



0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175

3175	btl_openib_component.c: No such file or directory.

	in btl_openib_component.c

(gdb) where

#0  0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175

#1  0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207

#2  0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,

sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125

#3  0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72

#4  0x0000000000404d1d in main () at ./mpi_pinned.c:69

Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:

Disable RDMA in MPI
Make the Infiniband driver and CUDA runtime compatible

The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled

mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned

Process 1 is on compute-0-1.local

Using pinned memory

Process 0 is on compute-0-0.local

Using pinned memory

Host->device bandwidth for process 0: 5907.023451 MB/sec

Host->device bandwidth for process 1: 5877.858109 MB/sec

MPI send/recv bandwidth: 2713.041591 MB/sec

Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility.

With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.

mpirun -np 2 -host c0-0,c0-1 mpi_pinned

Process 0 is on compute-0-0.local

Using pinned memory

Process 1 is on compute-0-1.local

Using pinned memory

Host->device bandwidth for process 0: 5904.930617 MB/sec

Host->device bandwidth for process 1: 5901.445854 MB/sec

MPI send/recv bandwidth: 3150.300854 MB/sec

This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!

Calling Thrust from CUDA Fortran

2011-06-02T11:59:00.001-07:00

CUDA 4.0 ships with the Thrust library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.

On the Thrust web page, there are a lot of examples and documentation. The basic idea of Thrust is to have containers, that manage host and device memory and simplify data transfers, iterators, that act like pointers and keep track of memory spaces, and algorithms, that are applied to containers.

This is a simple Thrust code to sort an array of random data.

#include <thrust/host_vector.h > 
#include <thrust/device_vector.h> 
#include <thrust/sort.h>

 int main(void) {
 // define a vector of 16M int on the host
 thrust::host_vector h_vec(1 << 24); 

// generate 16M random numbers on the host
thrust::generate(h_vec.begin(), h_vec.end(), rand);

 // transfer data to the device
 thrust::device_vector d_vec = h_vec; 

// sort data on the device
 thrust::sort(d_vec.begin(), d_vec.end()); 

// transfer data back to host
 thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); 
return 0;

An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:

// allocate device vector
 thrust::device_vector d_vec(4); 
// obtain raw pointer to device vector’s memory
 int * ptr = thrust::raw_pointer_cast(&d_vec[0]);

The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.

// Filename: csort.cu
// nvcc  -c -arch sm_13 csort.cu 
#include <thrust/device_vector.h> 
#include <thrust/device_vector.h>
#include <thrust/sort.h>

extern "C" {
//Sort for  integer arrays
void sort_int_wrapper( int *data, int N)
 {
  // Wrap raw pointer with a device_ptr
    thrust::device_ptr <int>  dev_ptr(data);
  // Use device_ptr in Thrust sort algorithm
    thrust::sort(dev_ptr, dev_ptr+N);
}

//Sort for  float arrays
 void sort_float_wrapper( float *data, int N)
 {
   thrust::device_ptr <float>  dev_ptr(data);
   thrust::sort(dev_ptr, dev_ptr+N);
 }

//Sort for  double arrays
 void sort_double_wrapper( double *data, int N)
 {
  thrust::device_ptr <double> dev_ptr(data);
  thrust::sort(dev_ptr, dev_ptr+N);
 }

}

We can compile the code using

nvcc -c -arch sm_13 csort.cu

This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.

The other missing piece is the interface to these C functions.

We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:

module thrust

interface thrustsort
 subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")
  use iso_c_binding
  integer(c_int),device:: input(*)
  integer(c_int),value:: N
 end subroutine

 subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")
  use iso_c_binding
  real(c_float),device:: input(*)
  integer(c_int),value:: N
 end subroutine

 subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")
  use iso_c_binding
  real(c_double),device:: input(*)
  integer(c_int),value:: N
 end subroutine

end interface

end module thrust

At this point we have all we need to write the CUDA Fortran code:

program testsort
use thrust
real, allocatable              :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: N=10
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
 cpuData(i)=random(i)
end do
cpuData(5)=100.

print *,"Before sorting", cpuData

gpuData=cpuData

call thrustsort(gpuData,size(gpuData))

cpuData=gpuData

print *,"After sorting", cpuData
end program

If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:

$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o
thrust_module.cuf:
sample_sort.cuf:

$ ./a.out 
 Before sorting   4.1630346E-02   0.9124327       0.7832350       0.6540373    
     100.0000       0.3956419       0.2664442       0.1372465     
   8.0488138E-03   0.8788511    

 After sorting   8.0488138E-03   4.1630346E-02   0.1372465       0.2664442     
   0.3956419       0.6540373       0.7832350       0.8788511     
   0.9124327        100.0000

The code is very simple:

declare two arrays, cpuData and gpuData.
allocate them using the standard allocate
copy cpuData from the host to gpuData on the GPU with a simple assignment
call the Thrust sort routine
copy sorted array back to the host
print the sorted array

Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.

program timesort
use cudafor
use thrust
implicit none
real, allocatable :: cpuData(:)
real, allocatable, device :: gpuData(:)
integer:: i,N=100000000

! cuda events for elapsing
type ( cudaEvent ) :: startEvent , stopEvent
real :: time, random
integer :: istat

! Create events
istat = cudaEventCreate ( startEvent )
istat = cudaEventCreate ( stopEvent )

! Allocate arrays
allocate(cpuData(N))
allocate(gpuData(N))

do i=1,N
 cpuData(i)=random(i)
end do

print *,"Sorting array of ",N, " single precision"

gpuData=cpuData

istat = cudaEventRecord ( startEvent , 0)
call thrustsort(gpuData,size(gpuData))

istat = cudaEventRecord ( stopEvent , 0)
istat = cudaEventSynchronize ( stopEvent )
istat = cudaEventElapsedTime ( time , startEvent , stopEvent )

cpuData=gpuData

print *," Sorted array in:",time," (ms)"

!Print the first five elements and the last five.
print *,"After sorting", cpuData(1:5),cpuData(N-4:N)
end program

We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.

pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort
thrust_module.cuf:
time_sort.cuf:

$ ./time_sort
 Sorting array of     100000000  single precision
  Sorted array in:    222.1711      (ms)
 After sorting   7.0585919E-09   1.0318221E-08   1.9398616E-08   3.1738640E-08 
   4.4078664E-08   0.9999999       0.9999999        1.000000     
    1.000000        1.000000   
./a.out 
 Sorting array of     100000000  single precision
  Sorted array in:    225.0452      (ms)
 After sorting   7.0585919E-09   1.0318221E-08   1.9398616E-08   3.1738640E-08 
   4.4078664E-08   0.9999999       0.9999999       0.9999999     
    1.000000        1.000000

Using CUDA 4.0 from CUDA Fortran

2011-05-10T12:57:00.000-07:00

It is possible to use CUDA 4.0 RC2 with CUDA Fortran.

Assuming that the CUDA 4.0 toolkit is installed in the location /usr/local/cuda, you will need to create a file rc4.0 containing the following lines:

set CUDAROOT=/usr/local/cuda;
set CUDAVERSION=4.0;

When you compile your .cuf files, you will need to pass this rc file with the -rc flag and add the -L flag if you are using libraries from the 4.0 toolking

pgf90 -rc=rc4.0 -Mcuda=cc20,nofma myfile.cuf -L/usr/local/cuda/lib64 -lcufft -lcurand

You can check if the compiler is picking up the new toolkit running ldd on the executable.

Using zero copy from Fortran

2010-07-19T22:11:00.000-07:00

This time I am going to show how to use zero copy feature in CUDA C from a generic Fortran 90 compiler. Since I am not going to use CUDA Fortran, we will need to use the iso C bindings feature available in pretty much all the Fortran 90 compilers ( PGI, Intel, g95, gfortran just to cite a few).

The basic idea is to use the original CUDA C functions to allocate host arrays that are page-locked ( aka pinned) and with the right attributes to be used by the zero copy feature of CUDA. If you are not familiar with the zero-copy feature in CUDA C, it allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. There is no need to do cudaMemcpy.

To declare the mapped array, we will need to perform the following steps:

Set the device flag for mapping host memory: this is achieved with a call to the cudaSetDeviceFlags with the flag cudaDeviceMapHost.
Allocate the host mapped arrays: this is achieved with cudaHostAlloc with the flag cudaHostAllocMapped.
Get the device pointers to the mapped memory. These are the pointers that we will pass to the CUDA kernels. This is achieved with calls to cudaHostGetDevicePointer.

Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.

Let's start with a module that declares the interfaces to the CUDA runtime functions that we will need: cudaHostAlloc, cudaFree and cudaSetDeviceFlag



!
! Module to interface the CUDA runtime functions
!

module cuda_runtime

        integer,parameter:: cudaHostAllocPortable=1, &
                            cudaHostAllocMapped= 2, &
                            cudaDeviceMapHost=8

  interface
!
! cudaHostAlloc
!
 integer  function cudaHostAlloc(buffer, size ,flag)  bind(C,name="cudaHostAlloc") 
  use iso_c_binding
  implicit none 
  type (C_PTR)  :: buffer 
  integer (C_SIZE_T), value :: size
  integer (C_INT), value :: flag
       end function cudaHostAlloc
!
! cudaFreeHost
!
  integer  function cudaFreeHost(buffer)  bind(C,name="cudaFreeHost") 
   use iso_c_binding 
   implicit none 
   type (C_PTR), value :: buffer
  end function cudaFreeHost
!
! cudaSetDeviceFlag
!
 integer  function cudaSetDeviceFlags(flag) bind(C,name="cudaSetDeviceFlags")
  use iso_c_binding
  implicit none 
  integer (C_INT), value :: flag
 end function cudaSetDeviceFlags
 
  end interface
end module cuda_runtime

Now that we have a working interface to the CUDA runtime, let's write a simple Fortran program that compute the exponential of each element of a double precision array, both on the CPU and GPU.

A is the input array, C is the output array from the GPU computation. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. B is an array that we will use to compute a reference solution on the CPU. We will use the standard Fortran allocator for this one.



!
! main.f90
!
program main

       use iso_c_binding
       use cuda_runtime
       implicit none

       integer, parameter :: fp_kind = kind(0.0d0) ! Double precision
 
       real(fp_kind) ,pointer, dimension (:) :: A,C
       real(fp_kind) ,allocatable, dimension (:) :: B
       type(C_PTR)::cptr_A,cptr_C

       integer i, N, seed
       integer err

!  Number of elements in the  arrays
       N=10

! Initialize the random number generator
       seed=1
       call random_seed(seed)

! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arrays

       write(*,*)'Allocate host memory'
       err=cudaSetDeviceFlags(cudaDeviceMapHost)
       if (err > 0) print *,"Error in setting cudaSetDeviceFlags=",err

       err = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped)
       if (err > 0) print *,"Error in allocating A with cuda HostAlloc =",err
       call c_f_pointer(cptr_A,A,(/N/))

       err = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped)
       if (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err
       call c_f_pointer(cptr_C,C,(/N/))

! From this point on, we can use A and C as normal Fortran array


! Allocate B using standard allocate call
       allocate(B(N))

! Initialize A with random numbers
        call random_number(A)


! computing  the reference solution on the CPU
       write(*,*)'computation on CPU'
               do i = 1, N
                       B(i) = dexp(A(i))
               enddo

!  same computation on the GPU
       write(*,*)'computation on GPU'
               call gexp(A,C,N)

! Print the computed quantities
        do i = 1, N
              write (*,'(i2,1x,4(g14.8))'),i,A(i),B(i),C(i),C(i)-B(i)
         enddo

! Release memory
       deallocate(B)
       err = cudaFreeHost (cptr_A)
       err = cudaFreeHost (cptr_C)

end program Main

Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.



/*
  kernel_code.cu
*/
#include <stdio.h>

// Device code
__global__ void CUDAexp(double* b, double* c, int N) {
        int index = threadIdx.x+blockDim.x*blockIdx.x;
        if( index < N) c[index] = exp(b[index]);
}


extern "C" void  gexp_(double *a, double *d, int* N1)
{
        double *b,*c;
        int N=*N1;
        cudaError_t statusb,statusc,err;


        statusb=cudaHostGetDevicePointer((void **)&b, (void *) a, 0);
        statusc=cudaHostGetDevicePointer((void **)&c, (void *) d, 0);

        if (statusb != 0 || statusc !=0) {
                printf("Error when locating memory to arrays on device!\n");
                printf("%s\n",cudaGetErrorString(statusb));
                printf("%s\n",cudaGetErrorString(statusc));
        }

     // Cal the cuda kernel, just one block for this simple example.
      CUDAexp<<<1,N>>>(b,c, N);

      err=cudaGetLastError();
      if(err != 0) printf("Error in kernel execution\n");
 
     // This is very important to retrieve the correct values
      cudaThreadSynchronize();
}

Now that we have all the files, let's write a simple makefile



all: TestZeroCopy

TestZeroCopy: main.f90  kernel_code.o
        ifort -o TestZeroCopy main.f90 kernel_code.o -L/usr/local/cuda/lib64 -lcudart -lstdc++

kernel_code.o: kernel_code.cu
        nvcc -c -O3 -arch sm_13 kernel_code.cu

clean:
        rm kernel_code.o TestZeroCopy cuda_runtime.mod

Compiling and running the code, will show the following output:


$./TestZeroCopy 

 Allocate host memory
 computation on CPU
 computation on GPU
 1 0.39208682E-06 1.0000004     1.0000004     0.0000000    
 2 0.25480443E-01 1.0258078     1.0258078     0.0000000    
 3 0.35251616     1.4226426     1.4226426     0.0000000    
 4 0.66691448     1.9482168     1.9482168     0.0000000    
 5 0.96305553     2.6196888     2.6196888    0.44408921E-15
 6 0.83828820     2.3124052     2.3124052    -.44408921E-15
 7 0.33535504     1.3984368     1.3984368    -.22204460E-15
 8 0.91532720     2.4975923     2.4975923     0.0000000    
 9 0.79586368     2.2163544     2.2163544    -.44408921E-15
10 0.83269314     2.2995033     2.2995033    0.44408921E-15

Calling CUFFT from Cuda Fortran

2010-05-24T10:09:00.001-07:00

This example shows how to call CUFFT from CUDA Fortran.
We are still going to use iso_c_binding to wrap the CUFFT functions, like we did for CUBLAS.

There are few points to outline in the wrapper:

CUFFT is using plans ( opaque object) to store information on the transforms and auxiliary array. We will treat a plan as integer in Fortran. The calls to create a plan and destroy a plan will generate all the proper information, the integer is just a pointer to the opaque object.

CUFFT uses several constants ( CUFFT_C2C, CUFFT_FORWARD, just to name a few). Some of them are defined as hex numbers.
Remember that to express an hex number in Fortran, you need to remove the 0x prefix and use Z.
CUFFT_R2C=0x2a will be defined as CUFFT_R2C=Z'2a' in Fortran.

To keep the code simple, we just show the wrapper for the creation and destruction of the plan ( cufftPlan1d and cufftDestroy) and for the execution of complex to complex transform both in single (cufftExecC2C) and double (cufftExecZ2Z) precision. Adding additional plan creations and execution is very simple.


!
! Define the INTERFACE to the NVIDIA CUFFT routines
!

module cufft

 integer,  public :: CUFFT_FORWARD = -1
 integer,  public :: CUFFT_INVERSE = 1
 integer,  public :: CUFFT_R2C = Z'2a' ! Real to Complex (interleaved)
 integer,  public :: CUFFT_C2R = Z'2c' ! Complex (interleaved) to Real
 integer,  public :: CUFFT_C2C = Z'29' ! Complex to Complex, interleaved
 integer,  public :: CUFFT_D2Z = Z'6a' ! Double to Double-Complex
 integer,  public :: CUFFT_Z2D = Z'6c' ! Double-Complex to Double
 integer,  public :: CUFFT_Z2Z = Z'69' ! Double-Complex to Double-Complex


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftPlan1d(cufftHandle *plan, int nx,cufftType type,int batch)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


 interface cufftPlan1d
       subroutine cufftPlan1d(plan, nx, type, batch) bind(C,name='cufftPlan1d')    
       use iso_c_binding
       integer(c_int):: plan
       integer(c_int),value:: nx, batch,type
       end subroutine cufftPlan1d
 end interface cufftPlan1d
       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftDestroy(cufftHandle plan)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

 interface cufftDestroy
       subroutine cufftDestroy(plan) bind(C,name='cufftDestroy')    
       use iso_c_binding
       integer(c_int),value:: plan
       end subroutine cufftDestroy
 end interface cufftDestroy

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftExecC2C(cufftHandle plan,
!              cufftComplex *idata,
!              cufftComplex *odata,
!              int direction)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 
  interface cufftExecC2C
       subroutine cufftExecC2C(plan, idata, odata, direction) &
                  & bind(C,name='cufftExecC2C')    
       use iso_c_binding
       use precision
       integer(c_int),value:: direction
       integer(c_int),value:: plan
       complex(fp_kind),device:: idata(*),odata(*)
       end subroutine cufftExecC2C
  end interface cufftExecC2C

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!    cufftExecZ2Z(cufftHandle plan,
!                 cufftDoubleComplex *idata,
!                 cufftDoubleComplex *odata,
!                 int direction);
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  interface cufftExecZ2Z
       subroutine cufftExecZ2Z(plan, idata, odata, direction) &
                  & bind(C,name='cufftExecZ2Z')    
       use iso_c_binding
       use precision
       integer(c_int),value:: direction
       integer(c_int),value:: plan
       complex(fp_kind),device:: idata(*),odata(*)
       end subroutine cufftExecZ2Z
  end interface cufftExecZ2Z

end module cufft

With the cufft wrapper and the precision module, we have all we need to write a simple program that perform a forward transform out of place, followed by an inverse transform in place. Since the output of CUFFT is not normilized, we should
see the final array equal to the initial one scaled by the lenght of the transform.


program fft_test
use precision
use cufft
complex(fp_kind)       ,allocatable:: a(:),b(:)
complex(fp_kind),device,allocatable:: a_d(:),b_d(:)

integer:: n
integer:: plan

n=8

! allocate arrays on the host
allocate (a(n),b(n))

! allocate arrays on the device
allocate (a_d(n))
allocate (b_d(n))

!initialize arrays on host
a=1

!copy arrays to device
a_d=a


! Print initial array
print *, "Array A:"
print *, a



! Initialize the plan
 call cufftPlan1D(plan,n,CUFFT_Z2Z,1)

! Execute FFTs
 call cufftExecZ2Z(plan,a_d,b_d,CUFFT_FORWARD)

 call cufftExecZ2Z(plan,b_d,b_d,CUFFT_INVERSE)


! Copy results back to host
 b=b_d

! Print initial array
print *, "Array B"
print *, b

!release memory on the host
deallocate (a,b)

!release memory on the device
deallocate (a_d,b_d)

! Destroy the plan
 call cufftDestroy(plan)

end program fft_test

To compile the new example, we will repeat what we did for the CUBLAS example. This time, instead of linking CUBLAS, we will link CUFFT.


pgf90 -Mcuda -o test_fft test_fft.cuf -L/usr/local/cuda/lib64 -lcufft

If we execute the code, we should see this output:

./test_fft

Array A:
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)

Array B
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)

As expected, the output is the input multiplied by the length of the transform ( 8 in this case).

Calling CUBLAS from CUDA Fortran

2010-05-18T14:57:00.000-07:00

This is a simple example that shows how to call a CUBLAS function ( SGEMM or DGEMM) from CUDA Fortran.

Lets' start by defining a couple of modules that we will use in the example.
The first one defines the precision we are going to use


module precision
! Precision control

  integer, parameter, public :: Single = kind(0.0)  ! Single precision
  integer, parameter, public :: Double = kind(0.0d0) ! Double precision

  integer, parameter, public :: fp_kind = Double
  !integer, parameter, public :: fp_kind = Single

end module precision

Selecting fp_kind Single or Double will allow us to use the same code for single and double precision.

CUBLAS, a BLAS library for CUDA, has a C interface. We are going to use iso_c_binding and the interface construct to be able to call the functions in this library directly from Fortran.


module cublas
!
! Define the INTERFACE to the NVIDIA C code cublasSgemm and cublasDgemm
!
      interface cuda_gemm
!
!      void cublasSgemm (char transa, char transb, int m, int n,
!                        int k, float alpha, const float *A, int lda,
!                  const float *B, int ldb, float beta, float *C, int ldc)
!
       subroutine cuda_sgemm(cta, ctb, m, n, k,&
         alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasSgemm')
             use iso_c_binding
             character(1,c_char),value :: cta, ctb
             integer(c_int),value :: m,n,k,lda,ldb,ldc
             real(c_float),value :: alpha,beta
             real(c_float), device, dimension(lda,*) :: A
             real(c_float), device, dimension(ldb,*) :: B
             real(c_float), device, dimension(ldc,*) :: C
       end subroutine cuda_sgemm

!
!      void cublasDgemm (char transa, char transb, int m, int n,
!                        int k, double alpha, const double *A, int lda,
!                  const double *B, int ldb, double beta, double *C, int ldc)
!
       subroutine cuda_dgemm(cta, ctb, m, n, k,&
         alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')
             use iso_c_binding
             character(1,c_char),value :: cta, ctb
             integer(c_int),value :: m,n,k,lda,ldb,ldc
             real(c_double),value :: alpha,beta
             real(c_double), device, dimension(lda,*) :: A
             real(c_double), device, dimension(ldb,*) :: B
             real(c_double), device, dimension(ldc,*) :: C
       end subroutine cuda_dgemm

       end interface

end module cublas

At this point we have all we need to write a simple example that will allocate the matrices A, B and C on the CPU and GPU, initialize them on the CPU, copy the content to the GPU, where we will perform a call to the appropriate GEMM ( depending on the precision selected) and transfer the result back to the CPU.


program gemm_test
use precision
use cublas
real(fp_kind)       ,allocatable:: a(:,:),b(:,:),c(:,:)
real(fp_kind),device,allocatable:: a_d(:,:),b_d(:,:),c_d(:,:)
real(fp_kind):: alpha,beta
integer:: n,m,k

n=4
m=4
k=4
alpha=1._fp_kind
beta=2._fp_kind

! allocate arrays on the host
allocate (a(m,k))
allocate (b(k,n))
allocate (c(m,n))

! allocate arrays on the device
allocate (a_d(m,k))
allocate (b_d(k,n))
allocate (c_d(m,n))

!initialize arrays on host
a=1
b=2
c=3

!copy arrays to device
a_d=a
b_d=b
c_d=c


print *, "Matrix A:"
print *, a

print *, "Matrix B:"
print *, b
print *, "Matrix C:"
print *, c

call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)

c=c_d 
print *, "Matrix C = alpha A*B+ beta C"
print *, c

!release memory on the host
deallocate (a,b,c)

!release memory on the device
deallocate (a_d,b_d,c_d)

end program gemm_test

We will need to compile this code with the CUDA Fortran compiler from Portland Group.

You should copy the code in a file test_gemm.cuf. It is important to use the right suffix, since we are using the device qualifier that is specific to CUDA Fortran. You can choose any name you want but you need to remember to use the .cuf suffix.

We are now ready to compile. We could create a Makefile, but for this simple example we can just invoke the compiler from the command line. We need to use the -Mcuda flag and then give the location and the name of the library (cublas) we want to link against.


                pgf90 -Mcuda -o test_gemm test_gemm.cuf  -L/usr/local/cuda/lib64 -lcublas

When you run the executable generated ( test_gemm), you should see an output similar to this one:


Matrix A:
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000     
 Matrix B:
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000     
 Matrix C:
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000     
 Matrix C = alpha A*B+ beta C
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000

If we want to rerun the code in single precision, we only need to select fp_kind=Single in the module precision and recompile.
The code has been written in such a way, that all the definitions are precision agnostic. Yes, Fortran 90 is quite powerful and elegant.