Friday, June 17, 2016

TensorFlow 0.8 on Jetson TK1

This post gives updated instructions on how to build TensorFlow 0.8 on Jetson TK1 now that NVIDIA has released a new compiler that can handle the variadic templates without compiler internal errors.

If you just want to try to install the whl file, this is  a direct link,  tensorflow-0.8.0-cp27-none-linux_armv7l.whl

I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.

Install the 7.0.76 compiler:

Before starting, you will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:


Now we can install it as usual:

sudo dpkg -i cuda-repo-l4t-7-0-local_7.0-76_armhf.deb 
sudo apt-get update
sudo apt-get install cuda-toolkit-7-0

At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5: 

ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda
ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda

You should see this output:

ubuntu@tegra-ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Fri_Dec_12_11:12:07_CST_2014
Cuda compilation tools, release 6.5, V6.5.35

ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_22_15:38:26_CST_2016
Cuda compilation tools, release 7.0, V7.0.74

Install protobuf and Bazel:
For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).
Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:

$ git clone
$ cd bazel
$ git checkout tags/0.1.4

Install TensorFlow 0.8:
The first thing to do it is to check out the source code and select the proper version:

$ git clone --recurse-submodules
$ cd tensorflow
$ git checkout r0.8

TensorFlow is expecting a 64bit system, we will need to change all the reference from lib64 to lib. We can find all the files with the strings and apply all the changes with these commands:

$ cd tensorflow
$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'

TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. 
This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.
When prompted, specify that you only want a 3.2 compute capability device.

ubuntu@tegra-ubuntu:~/tensorflow$ TF_UNOFFICIAL_SETTING=1 ./configure
Please specify the location of python. [Default is /usr/bin/python]: 
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: 
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 
Please specify the location where CUDA  toolkit is installed. Refer to for more details. [Default is /usr/local/cuda]: 
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 
Please specify the location where cuDNN  library is installed. Refer to for more details. [Default is /usr/local/cuda]: 
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at:
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.2
Setting up Cuda include
Setting up Cuda lib
Setting up Cuda bin
Setting up Cuda nvvm
Configuration finished

Now that the initial set up is done, it is time to change the compiler used by Bazel.

ubuntu@tegra-ubuntu:~/tensorflow$ cd third_party/gpus/cuda/
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ rm -fr bin nvvm
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/bin/ bin
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/nvvm/ nvvm

Before starting the build ( that is going to take a very long time), we will need to modify few files.

 To avoid double instantiation, guard the second functor for InflatePadAndShuffle with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::InflatePadAndShufflefloat
, 4,
 To avoid double instantiation, guard the second functor  for ShuffleAndReverse with:
/* On ARMv7 Eigen::DenseIndex is typedefed to int */
#ifndef __arm__
template struct functor::ShuffleAndReversefloat
, 4,

 ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime. You can use the modification from the previous post or the following code:

static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
#ifdef __arm__
  LOG(INFO) << "ARMV7 does not support NUMA - returning NUMA node zero";
  return 0;
  return kUnknownNumaNode;

this  is a new memory allocator, that is going to cause a floating point exception unless you change the following code:

if (kCudaHostMemoryUseBFC) {
      allocator =
#ifdef __arm__
          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 31,
                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);
          new BFCAllocator(new CUDAHostAllocator(se), 1LL << 36 /*64GB max*/,
                           true /*allow_growth*/, "cuda_host_bfc" /*name*/);
    } else {

We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to  start the build and then wait for the first failure:

$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures -s --config=cuda //tensorflow/cc:tutorials_example_trainer

If on your first compile of tensorflow you get the following error:

ERROR: /home/ubuntu/tensorflow/tensorflow/cc/BUILD:61:1: error loading package 'tensorflow/core': Extension file not found. Unable to load package for '//google/protobuf:protobuf.bzl': BUILD file not found on package path and referenced by '//tensorflow/cc:tutorials_example_trainer'.

You need to init update in the tensorflow repository to get the google/protobuf clone using:

git submodule update --init 

At this point, I can edit the file Macros.h in Eigen.
This file is located in the .cache directory:

ubuntu@tegra-ubuntu:~/.cache$ find . -name Macros.h -print

The nvcc check needs to be eliminated:
-#if !defined(__NVCC__) || !defined(EIGEN_ARCH_ARM_OR_ARM64)

We can now restart the build and it will go through. 
After you are done, you can test it with:

$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

You should see a similar output:

# Lots of output. This tutorial iteratively calculates the major eigenvalue of
# a 2x2 matrix, on GPU. The last few lines look like this.
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]

We are now ready to create the pip package and install it:
# To build with GPU support:
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.8.0-cp27-none-linux_armv7l.whl

Congratulation, TensorFlow is now installed on your system.

Most of the tests are passing, but the image classification example is giving the wrong results. Now that the community can build it and play with it, someone can find the source of the error(s).

I downloaded the python files from TensorFlow-Tutorial and they seem to work:

git clone