CUDA Musing

Friday, June 17, 2016

TensorFlow 0.8 on Jetson TK1

This post gives updated instructions on how to build TensorFlow 0.8 on Jetson TK1 now that NVIDIA has released a new compiler that can handle the variadic templates without compiler internal errors.

If you just want to try to install the whl file, this is a direct link, tensorflow-0.8.0-cp27-none-linux_armv7l.whl

I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.

Install the 7.0.76 compiler:

Before starting, you will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:

wget http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

Now we can install it as usual:

sudo dpkg -i cuda-repo-l4t-7-0-local_7.0-76_armhf.deb

sudo apt-get update

sudo apt-get install cuda-toolkit-7-0

At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5:

ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda

ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda

You should see this output:

ubuntu@tegra-ubuntu:~$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Built on Fri_Dec_12_11:12:07_CST_2014

Cuda compilation tools, release 6.5, V6.5.35

ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Built on Mon_Feb_22_15:38:26_CST_2016

Cuda compilation tools, release 7.0, V7.0.74

Install protobuf and Bazel:

For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).

Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:

$ git clone https://github.com/bazelbuild/bazel.git

$ cd bazel

$ git checkout tags/0.1.4

Install TensorFlow 0.8:

The first thing to do it is to check out the source code and select the proper version:

$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow

$ cd tensorflow

$ git checkout r0.8

TensorFlow is expecting a 64bit system, we will need to change all the reference from lib64 to lib. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow

$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2.

This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.

When prompted, specify that you only want a 3.2 compute capability device.

ubuntu@tegra-ubuntu:~/tensorflow$ TF_UNOFFICIAL_SETTING=1 ./configure

Please specify the location of python. [Default is /usr/bin/python]: 

Do you wish to build TensorFlow with GPU support? [y/N] y

GPU support will be enabled for TensorFlow

Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 

Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 

Please specify the location where CUDA  toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify the Cudnn version you want to use. [Leave empty to use system default]: 

Please specify the location where cuDNN  library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 

Please specify a list of comma-separated Cuda compute capabilities you want to build with.

You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.

Please note that each additional compute capability significantly increases your build time and binary size.

[Default is: "3.5,5.2"]: 3.2

Setting up Cuda include

Setting up Cuda lib

Setting up Cuda bin

Setting up Cuda nvvm

Configuration finished

Now that the initial set up is done, it is time to change the compiler used by Bazel.

ubuntu@tegra-ubuntu:~/tensorflow$ cd third_party/gpus/cuda/

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ rm -fr bin nvvm

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/bin/ bin

ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/nvvm/ nvvm

Before starting the build ( that is going to take a very long time), we will need to modify few files.

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:

To avoid double instantiation, guard the second functor for InflatePadAndShuffle with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::InflatePadAndShufflefloat

, 4,

                                              Eigen::DenseIndex>;

#endif 

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:

To avoid double instantiation, guard the second functor for ShuffleAndReverse with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::ShuffleAndReversefloat

, 4,

                                           Eigen::DenseIndex>;

#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:

ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime. You can use the modification from the previous post or the following code:

static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {

#ifdef __arm__

  LOG(INFO) << "ARMV7 does not support NUMA - returning NUMA node zero";

  return 0;

#else

 ........

  return kUnknownNumaNode;

#endif

}

tensorflow/core/common_runtime/gpu/process_state.cc:

this is a new memory allocator, that is going to cause a floating point exception unless you change the following code:

if (kCudaHostMemoryUseBFC) {

      allocator =

#ifdef __arm__

          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 31,

                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);

#else

          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 36 /*64GB max*/,

                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);

#endif

    } else {

We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to start the build and then wait for the first failure:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures -s --config=cuda //tensorflow/cc:tutorials_example_trainer

If on your first compile of tensorflow you get the following error:

ERROR: /home/ubuntu/tensorflow/tensorflow/cc/BUILD:61:1: error loading package 'tensorflow/core': Extension file not found. Unable to load package for '//google/protobuf:protobuf.bzl': BUILD file not found on package path and referenced by '//tensorflow/cc:tutorials_example_trainer'.

You need to init update in the tensorflow repository to get the google/protobuf clone using:

git submodule update --init

At this point, I can edit the file Macros.h in Eigen.

This file is located in the .cache directory:

ubuntu@tegra-ubuntu:~/.cache$ find . -name Macros.h -print

./bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/eigen_archive/eigen-eigen-3f653ace7d28/Eigen/src/Core/util/Macros.h

The nvcc check needs to be eliminated:

-#if !defined(__NVCC__) || !defined(EIGEN_ARCH_ARM_OR_ARM64)

 #define EIGEN_HAS_VARIADIC_TEMPLATES 1

 #endif

-#endif

We can now restart the build and it will go through.

After you are done, you can test it with:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:

# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.8.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

Most of the tests are passing, but the image classification example is giving the wrong results. Now that the community can build it and play with it, someone can find the source of the error(s).

I downloaded the python files from TensorFlow-Tutorial and they seem to work:

git clone https://github.com/nlintz/TensorFlow-Tutorials.git

Friday, November 27, 2015

Building TensorFlow for Jetson TK1

Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs.

TensorFlow has a GPU backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway. This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download the wheel file I generated and do a pip install. The file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing ).

TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new

The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.

Java 8:

The first step is to install Java8, but this is quite simple since Oracle provides a package:

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java8-installer

Protobuf:

In order to build protobuf and bazel, we will need several other packages. The exact list will depend on the status of your Jetson, but you will need at least these ones:

$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev

After downloading the latest source from github:

$ git clone https://github.com/google/protobuf.git

you need to first generate the configuration file and then run make:

$ cd protobuf

$ ./autogen.sh

$ ./configure --prefix=/usr

$ make -j 4

$ sudo make install

Protoc will be installed in /usr/lib and /usr/bin, this will be important when we run bazel since it tries to use a sandbox and will not find the libraries in /usr/local/lib.

You should see this output, if you have followed all the steps:

ubuntu@tegra-ubuntu:~/protobuf$ protoc --version

libprotoc 3.0.0

We also need to build the java interface for protobuf, that will require Maven.

Luckily maven is available from the repos, so we can just issue a:

$ sudo apt-get install maven

Go to the subdirectory java inside protobuf and type:

$ mvn package

Once the build is completes, there will be a protobuf-java-3.0.0-beta-1.jar inside the target subdirectory.

Bazel:

We are now ready to tackle Bazel.

The first step is to download the source code for Bazel ( using the 0.1.0 version, that it is known to work with Tensorflow).

$ git clone https://github.com/bazelbuild/bazel.git

$ cd bazel

$git checkout tags/0.1.0

Before compiling, we need to copy the protoc binary we just built as third_party/protobuf/protoc-linux-arm32.exe.

We also need to copy the jar file from protobuf in the same directory. Bazel is expecting an alpha-3 version, but we have built a beta-1.

There is probably a better way of doing this, but just copying the file and rename it did the trick for me.

$ cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe

$ cp ~/protobuf/java/target/protobuf-java-3.0.0-beta-1.jar third_party/protobuf/protobuf-java-3.0.0-alpha-3.jar

We are now ready to compile bazel.

$ ./compile.sh

At the end of the compilation, the bazel binary will be in the output directory. You can add this directory

to your path or copy the binary in /usr/local/bin

TensorFlow

We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1.

You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from

the NVIDIA web site ( it is the one for Jetson TX1).

While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc.

All the libraries and runtime will be the standard 6.5 ones.

On my system I have also enabled some swap space. You can plug a USB memory stick, create a swap file and mount it with

$ sudo mkswap /dev/sda

$ sudo swapon /dev/sda

The first step to build TensorFlow is to clone the github repository:

$ git clone -recurse-submodules https://github.com/tensorflow/tensorflow

and install other dependencies:

$ sudo apt-get install python-numpy swig python-dev

TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.

Before starting the installation, we will need to modify several files. We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow

$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

$ grep -Rl "so.7.0"| xargs sed -i 's/so\.7\.0/so\.6\.5/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2.

This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.

When prompted, specify that you only want a 3.2 compute capability device.

$ TF_UNOFFICIAL_SETTING=1 ./configure

# Same as the official settings above

WARNING: You are configuring unofficial settings in TensorFlow. Because some

external libraries are not backward compatible, these settings are largely

untested and unsupported.

Please specify a list of comma-separated Cuda compute capabilities you want to

build with. You can find the compute capability of your device at:

https://developer.nvidia.com/cuda-gpus.

Please note that each additional compute capability significantly increases

your build time and binary size. [Default is: "3.5,5.2"]: 3.2

Setting up Cuda include

Setting up Cuda lib

Setting up Cuda bin

Setting up Cuda nvvm

Configuration finished

After the configure, bazel has copied or symlinked all the binaries and libraries needed for the build in the third_party/gpus/cuda subdirectory .

It is now time to replace the cuda compiler with the one from the 7.0 toolchain.

We want to extract (not install) the files from the cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb package with the following commands:

$ dpkg -x cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb /tmp/cuda_repo

$ cd /tmp/cuda_repo/var/cuda-repo-7-0-local

$ dpkg -x cuda-core-7-0_7.0-71_armhf.deb /tmp/cuda7.0

$ rm -fr /tmp/cuda_repo

$ cd ~tensorflow/third_party/gpus/cuda

$ rm -fr bin nvvm

$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/bin bin

$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/nvvm nvvm

$ rm -fr /tmp/cuda7.0

At this point, bazel is ready to use the 7.0 toolchain to compile Tensorflow.

We still need to add the ARM target to the build.

This can be done adding the following lines to the third_party/gpus/crosstool/CROSSTOOL file:

default_toolchain {

cpu: "arm"

toolchain_identifier: "local_linux"

}

Before starting the build, we need to edit few files to avoid compiler crashes and avoid double instantiations

(on ARM v7, Eigen::DenseIndex is typedefed to int):

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc

tensorflow/core/kernels/adjust_contrast_op.h

third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h:

the compiler is crashing when evaluating the code inside the ifdef at line 312. We can just take the alternative path.

Change line 312 to something like:

#ifdef EIGEN_HAS_VARIADIC_TEMPLATES_TK1

tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:

To avoid double instantiation, guard the second functor for InflateAnsShuffle with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::InflatePadAndShuffle

Eigen::DenseIndex>;

#endif

We also need to comment the tensor.h include ( will crash the compiler)

//#include "tensorflow/core/public/tensor.h"

tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:

To avoid double instantiation, guard the second functor for ShuffleAndReverse with:

/* On ARMv7 Eigen::DenseIndex is typedefed to int */

#ifndef __arm__

template struct functor::ShuffleAndReverse

Eigen::DenseIndex>;

#endif

tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:

ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime:

FILE *file = fopen(filename.c_str(), "r");

if (file == nullptr) {

LOG(ERROR) << "could not open file to read NUMA node: " << filename;

#ifdef __arm__

// There is no numa_node on Jetson TK1

return 0;

#else

return kUnknownNumaNode;

#endif

tensorflow/core/kernels/adjust_contrast_op.h:

The compiler is crashing on some initializations, we need to rewrite them in a simpler way:

//MF Eigen::array scalar_broadcast{{batch, height, width, channels}};

Eigen::array scalar_broadcast;

scalar_broadcast[0] = batch;

scalar_broadcast[1] = height;

scalar_broadcast[2] = width;

scalar_broadcast[3] = channels;

#if !defined(EIGEN_HAS_INDEX_LIST)

//MF Eigen::array reduction_axis{{1, 2}};

//MF Eigen::array scalar{{1, 1, 1, 1}};

//MF Eigen::array broadcast_dims{{1, height, width, 1}};

//MF Eigen::Tensor::Dimensions reshape_dims{{batch, 1, 1, channels}};

Eigen::array reduction_axis;

reduction_axis[0]=1;

reduction_axis[1]=2;

Eigen::array scalar;

scalar[0]=1;

scalar[1]=1;

scalar[2]=1;

scalar[3]=1;

Eigen::array broadcast_dims;

broadcast_dims[0]=1;

broadcast_dims[1]=height;

broadcast_dims[2]=width;

broadcast_dims[3]=1;

Eigen::Tensor::Dimensions reshape_dims;

reshape_dims[0]=batch;

reshape_dims[1]=1;

reshape_dims[2]=1;

reshape_dims[3]=channels;

#else

The source code is now ready. Jeston TK1 has only 2GB of memory and bazel will try to compile several files at the same time.

We want to avoid this, so we will pass a local_resource flag that will use only 2GB and half core (don't ask, if you specify one it will still try

to compile two files at the same time). This build will take a long time:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer

If you get some failures during the build, keep trying, bazel scheduling seems to be non-deterministic and the Tensorflow code is really stressing the

compiler.

Once the build is completed, we can test the code:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of

# a 2x2 matrix, on GPU. The last few lines look like this.

000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:

# To build with GPU support:

$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

# The name of the .whl file will depend on your platform.

$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

We can also try a more interesting example of image classification:

bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/examples/label_image/...

$ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip -O tensorflow/examples/label_image/data/inception5h.zip

$ unzip tensorflow/examples/label_image/data/inception5h.zip -d tensorflow/examples/label_image/data/

$ mv tensorflow/examples/label_image/data/tensorflow_inception_graph.pb tensorflow/examples/label_image/data/googlenet_graph.pb

$ mv tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt tensorflow/examples/label_image/data/googlenet_labels.txt 

And run it with:

$ bazel-bin/tensorflow/examples/label_image/label_image

I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 1

E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:890] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node

I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties:

name: GK20A

major: 3 minor: 2 memoryClockRate (GHz) 0.852

pciBusID 0000:00:00.0

Total memory: 1.85GiB

Free memory: 218.46MiB

I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0

I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y

I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 18.46MiB bytes.

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0xa45ea000 extends to 0xa585f000

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB

I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB

I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 1

I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)

I tensorflow/examples/label_image/main.cc:221] military uniform (866): 0.902268

I tensorflow/examples/label_image/main.cc:221] bow tie (817): 0.05407

I tensorflow/examples/label_image/main.cc:221] suit (794): 0.0113196

I tensorflow/examples/label_image/main.cc:221] bulletproof vest (833): 0.0100269

I tensorflow/examples/label_image/main.cc:221] bearskin (849): 0.00649747

Saturday, October 12, 2013

CUDA 5.5 and Xcode 5

The latest Xcode 5 update seems to have broken nvcc.

If you try to compile a CUDA program, you will see a similar error:

%nvcc -c qr.cu

clang: error: unsupported option '-dumpspecs'

clang: error: no input files

There is a simple workaround.

%nvcc -ccbin=/usr/bin/clang -c qr.cu

A more convenient way of adding this, it is to define an alias for nvcc.

You can add this line to your .bash_profile

alias nvcc='nvcc -ccbin=/usr/bin/clang'

or just define it in your shell,

alias 'nvcc=nvcc -ccbin=/usr/bin/clang'

Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b, have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:

__global__ void add(double * in, double a, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
in[idx] += a;
}
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

Generate a PTX file from the kernel source

nvcc -ptx -arch sm_20 add.cu

Construct the kernel object from the PTX file

k=parallel.gpu.CUDAKernel('add.ptx','add.cu');

Set up the block and grid configuration, for example 28 blocks of 256 threads each:

k.ThreadBlockSize=[256 1 1]
k.GridSize=[28 1 1]

Execute the kernel.

o = feval(k,rand(10,1),2.,10)
The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
implicit none
double precision, intent(inout) :: a(*)
double precision, value :: b
integer , value :: N
integer :: i

i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if ( i <=N) a(i) = a(i)+b

end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.

When loading the PTX file generated by CUDA C, we were passing both the PTX file name and

the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel.

This is what we need to load the PTX file generated from CUDA Fortran.

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

and execute the kernel.

of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements

i1=gpuArray(rand(10,1))

% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')

% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

% Execute the kernel
of=feval(kf,i1,10.,10)

An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.

Now that we understand all the steps, let's move to something more complex and discuss few more points.

We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock

( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).

The kernel is embedded in a module, since we are using a global variable for the lock.

There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers

for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none

  integer, parameter :: fp_kind = kind(0.0d0)

  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)

    real(fp_kind), intent(in) :: input(N)

    real(fp_kind) :: totalsum(1)

    integer,value :: N

    real(fp_kind), shared, dimension(256) :: psum

    integer :: i,index, inext

    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind

    do i=index,N,BlockDim%x*GridDim%x

       lsum = lsum+ input(i)

    end do

    ! Local reduction per block

    index=threadIdx%x

    psum(index)=lsum

    call syncthreads()

    inext=blockDim%x/2

    do while ( inext >=1 )

       if (index <=inext) psum(index)=psum(index)+psum(index+inext)

       inext = inext /2

       call syncthreads()

    end do

    ! Final reduction among block with atomic lock

    if (index == 1) then

       do while ( atomiccas(lock,0,1) == 1)

       end do

       totalsum(1)=totalsum(1)+psum(1)

       call threadfence()

       lock =0

    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.

When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.

While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

Error using handleKernelArgs (line 61)

Found more than one entry point in the PTX code.  Possible names are:

sumgpu_sum_

sumgpu_sum2_

In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly. However, with the prototype signature we specified, the first array that in the original code was

with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'. This is not a major problem, but we will

need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.

We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 1

       NumRHSArguments: 3

         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only.

We are now ready to run the code:

%Generate an array of 1024 elements on the CPU

a=rand(1024,1);

% Copy the array to a GPU array ag

ag=gpuArray(a);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[28 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel

disp('CUDA Fortran kernel:')

sumg=feval(kf,ag,sumg,1024)

% Recompute the sum using the intrinsic MATLAB function

disp('Intrinsic MATLAB sum on GPU:')

sum_matlab=sum(ag)

%Check the result

disp('Difference:')

sumg-sum_matlab

obtaining the following output:

CUDA Fortran kernel:

sumg =

509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

509.2181

Difference:

ans =

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.

We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU

seed=0;

gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);

parallel.gpu.RandStream.setGlobalStream(gpu_stream);

N=50000000;

%Generate the random numbers directly on the GPU

ag=gpuArray.randn(N,1);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[128 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel and time the execution

tic;sumg=feval(kf,ag,sumg,N);toc

% Invoke the intrinsic sum and time the execution

tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.

Elapsed time is 0.000357 seconds.

Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use

standalone code that was already written in a very productive environment such as MATLAB.

CUDA Musing

Friday, June 17, 2016

TensorFlow 0.8 on Jetson TK1

Friday, November 27, 2015

Building TensorFlow for Jetson TK1

Saturday, October 12, 2013

CUDA 5.5 and Xcode 5

Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

Search This Blog

Followers

Blog Archive

About Me