tag:blogger.com,1999:blog-54176135440428551532024-03-13T09:24:56.109-07:00CUDA MusingRandom collection of CUDA examples, tricks and suggestions.Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.comBlogger15125tag:blogger.com,1999:blog-5417613544042855153.post-52338732170070788382016-06-17T11:59:00.003-07:002016-06-23T13:49:47.526-07:00TensorFlow 0.8 on Jetson TK1<div style="text-align: justify;">
This post gives updated instructions on how to build TensorFlow 0.8 on Jetson TK1 now that NVIDIA has released a new compiler that can handle the variadic templates without compiler internal errors.</div>
<br />
If you just want to try to install the whl file, this is a direct link, <a href="https://drive.google.com/file/d/0B1uGKNpQ7xNqa2RRYlMtZXZ6WVk/view?usp=sharing" target="_blank">tensorflow-0.8.0-cp27-none-linux_armv7l.whl</a><br />
<br />
I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.<br />
<div>
<br /></div>
<br />
<b>Install the 7.0.76 compiler:</b><br />
<br />
Befo<span style="font-family: inherit;">re starting, you</span> will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:<br />
<div style="line-height: normal; min-height: 14px;">
<br /></div>
<br />
<br />
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">wget http://developer.download.nvidia.com/embedded/L4T/r24_Release_v1.0/CUDA/cuda-repo-l4t-7-0-local_7.0-76_armhf.deb</span></div>
<br />
Now we can install it as usual:<br />
<br />
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">sudo dpkg -i cuda-repo-l4t-7-0-local_7.0-76_armhf.deb </span></div>
<div style="color: #323333; line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace; letter-spacing: 0.0px;">sudo apt-get update</span></div>
<div style="line-height: normal; min-height: 14px;">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="color: #323333; letter-spacing: 0px;">sudo apt-get install cuda-toolkit-7-0</span><span style="letter-spacing: 0.0px;"></span></span></div>
<div style="line-height: normal; min-height: 14px;">
<span style="color: #323333; letter-spacing: 0px;"><br /></span></div>
<div style="line-height: normal; min-height: 14px;">
At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5: </div>
<div style="line-height: normal; min-height: 14px;">
<br /></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda</span></div>
<div style="line-height: normal; min-height: 14px;">
<br /></div>
<div style="line-height: normal; min-height: 14px;">
<span style="font-family: inherit; font-size: small;">You should see this output:</span><br />
<span style="font-family: inherit; font-size: small;"><br /></span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@tegra-ubuntu:~$ nvcc -V</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">nvcc: NVIDIA (R) Cuda compiler driver</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">Copyright (c) 2005-2014 NVIDIA Corporation</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">Built on Fri_Dec_12_11:12:07_CST_2014</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">Cuda compilation tools, release 6.5, V6.5.35</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">nvcc: NVIDIA (R) Cuda compiler driver</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">Copyright (c) 2005-2015 NVIDIA Corporation</span></div>
<div style="line-height: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">Built on Mon_Feb_22_15:38:26_CST_2016</span></div>
<div style="line-height: normal; min-height: 14px;">
<span style="font-family: "courier new" , "courier" , monospace; font-size: 10px;">Cuda compilation tools, release 7.0, V7.0.74</span></div>
<div>
<br /></div>
<div>
<b>Install protobuf and Bazel:</b></div>
<div>
For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).</div>
<div>
Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">$ git clone https://github.com/bazelbuild/bazel.git</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">$ cd bazel</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace; font-size: xx-small;">$ git checkout tags/0.1.4</span></div>
</div>
<div>
<div style="line-height: normal;">
<div style="font-family: monaco;">
<br /></div>
<div style="font-family: monaco;">
<b style="font-family: -webkit-standard;">Install TensorFlow 0.8:</b></div>
The first thing to do it is to check out the source code and select the proper version:<br />
<br />
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
$ cd tensorflow</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
$ git checkout r0.8</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<br /></div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<br /></div>
</div>
</div>
<div>
<div style="color: #333333; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">TensorFlow is expecting a 64bit system, w</span><span style="letter-spacing: 0px;">e will need to change all the reference from lib64 to lib. We can find all the files with the strings and apply all the changes with these commands:</span></div>
<div style="color: #333333; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #333333; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0px;">$ cd tensorflow</span></div>
<div style="background-color: #ebebeb; color: #333333; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0px;">$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'</span></div>
<div style="color: #333333; font-family: Georgia, serif; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0px;"><b></b></span><br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. </span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.</span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">When prompted, specify that you only want a 3.2 compute capability device.</span></div>
</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/tensorflow$ TF_UNOFFICIAL_SETTING=1 ./configure</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify the location of python. [Default is /usr/bin/python]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Do you wish to build TensorFlow with GPU support? [y/N] y</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
GPU support will be enabled for TensorFlow</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify the location where CUDA toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify the Cudnn version you want to use. [Leave empty to use system default]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: </div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please specify a list of comma-separated Cuda compute capabilities you want to build with.</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Please note that each additional compute capability significantly increases your build time and binary size.</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
[Default is: "3.5,5.2"]: 3.2</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Setting up Cuda include</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Setting up Cuda lib</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Setting up Cuda bin</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Setting up Cuda nvvm</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
Configuration finished</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
Now that the initial set up is done, it is time to change the compiler used by Bazel.</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/tensorflow$ cd third_party/gpus/cuda/</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ rm -fr bin nvvm</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/bin/ bin</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/tensorflow/third_party/gpus/cuda$ cp -R /usr/local/cuda-7.0/nvvm/ nvvm</div>
</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
Before starting the build ( that is going to take a very long time), we will need to modify few files.</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<div style="color: #323333; font-family: Georgia; line-height: normal;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:</b></span></div>
<div style="color: #323333; font-family: Georgia; line-height: normal;">
<span style="letter-spacing: 0.0px;"> To avoid double instantiation, guard the second functor for InflatePadAndShuffle with:</span></div>
<div style="color: #5330e1; font-family: Monaco; font-size: 10px; line-height: normal;">
/* On ARMv7 Eigen::DenseIndex is typedefed to int */</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#ifndef __arm__</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #34bd26;">template</span> <span style="color: #34bd26;">struct</span> functor::InflatePadAndShuffle<gpudevice span="" style="color: #34bd26;">float</gpudevice></div>
</div>
</div>
, <span style="color: #c33720;">4</span>,
<br />
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
Eigen::DenseIndex>;</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#endif<span style="color: #323333; font-family: "georgia"; font-size: 16px; letter-spacing: 0px; line-height: normal;"> </span></div>
<div style="color: #323333; line-height: normal;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:</b></span></div>
<div style="color: #323333; font-family: Georgia; line-height: normal;">
<span style="letter-spacing: 0.0px;"> To avoid double instantiation, guard the second functor for ShuffleAndReverse with:</span></div>
<div style="color: #5330e1; font-family: Monaco; font-size: 10px; line-height: normal;">
/* On ARMv7 Eigen::DenseIndex is typedefed to int */</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#ifndef __arm__</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #34bd26;">template</span> <span style="color: #34bd26;">struct</span> functor::ShuffleAndReverse<gpudevice span="" style="color: #34bd26;">float</gpudevice></div>
, <span style="color: #c33720;">4</span>,
<br />
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
Eigen::DenseIndex>;</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#endif</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<div style="color: #323333; line-height: normal;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:</b></span></div>
<div style="color: #323333; font-family: Georgia; line-height: normal;">
<span style="letter-spacing: 0.0px;"> ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime. You can use the modification from the previous post or the following code:</span></div>
<div style="color: #323333; font-family: Georgia; line-height: normal;">
<span style="letter-spacing: 0.0px;"><br /></span></div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #34bd26;">static</span> <span style="color: #34bd26;">int</span> TryToReadNumaNode(<span style="color: #34bd26;">const</span> string &pci_bus_id, <span style="color: #34bd26;">int</span> device_ordinal) {</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#ifdef __arm__</div>
<div style="color: #c33720; font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: black;"> LOG(INFO) << </span>"ARMV7 does not support NUMA - returning NUMA node zero"<span style="color: black;">;</span></div>
<div style="color: #ce7924; font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: black;"> </span>return<span style="color: black;"> </span><span style="color: #c33720;">0</span><span style="color: black;">;</span></div>
<div style="color: #323333; font-family: Georgia; line-height: normal;">
</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#else</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
........</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #ce7924;">return</span> kUnknownNumaNode;</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#endif</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
}</div>
</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div>
<b style="color: #323333; font-family: Georgia; font-size: 16px;">tensorflow/core/</b><span style="color: #323333; font-family: "georgia";"><b>common_runtime/gpu/process_state.cc:</b></span></div>
<div>
<span style="color: #323333; font-family: "georgia";">this is a new memory allocator, that is going to cause a floating point exception unless you change the following code:</span></div>
<div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #ce7924;"><br /></span></div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #ce7924;">if</span> (kCudaHostMemoryUseBFC) {</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
allocator =</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#ifdef __arm__</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #ce7924;">new</span> BFCAllocator(<span style="color: #ce7924;">new</span> CUDAHostAllocator(se), <span style="color: #c33720;">1LL</span> << <span style="color: #c33720;">31</span>,</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #c33720;">true</span> <span style="color: #5330e1;">/*allow_growth*/</span>, <span style="color: #c33720;">"cuda_host_bfc"</span> <span style="color: #5330e1;">/*name*/</span>);</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#else</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #ce7924;">new</span> BFCAllocator(<span style="color: #ce7924;">new</span> CUDAHostAllocator(se), <span style="color: #c33720;">1LL</span> << <span style="color: #c33720;">36</span> <span style="color: #5330e1;">/*64GB max*/</span>,</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
<span style="color: #c33720;">true</span> <span style="color: #5330e1;">/*allow_growth*/</span>, <span style="color: #c33720;">"cuda_host_bfc"</span> <span style="color: #5330e1;">/*name*/</span>);</div>
<div style="color: #d53bd3; font-family: Monaco; font-size: 10px; line-height: normal;">
#endif</div>
<div style="font-family: Monaco; font-size: 10px; line-height: normal;">
} <span style="color: #ce7924;">else</span> {</div>
</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to start the build and then wait for the first failure:</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="background-color: #fafafa; color: #36474f; font-family: "helvetica neue";">$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures -s --config=cuda //tensorflow/cc:tutorials_example_trainer</span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a;">
<span style="background-color: white; color: #333333; font-family: "georgia" , serif; text-align: justify;">If on your first compile of tensorflow you get the following error:</span><br />
<br style="color: #333333; font-family: Georgia, serif; text-align: justify;" />
<span style="font-family: inherit;"><span style="background-color: white; color: #333333; text-align: justify;">ERROR: /home/ubuntu/tensorflow/tensorflow/cc/BUILD:61:1: error loading package 'tensorflow/core': Extension file not found. Unable to load package for '//google/protobuf:protobuf.bzl': BUILD file not found on package path and referenced by '//tensorflow/cc:tutorials_example_trainer'.</span><br style="color: #333333; text-align: justify;" /><br style="color: #333333; text-align: justify;" /><span style="background-color: white; color: #333333; text-align: justify;">You need to init update in the tensorflow repository to get the google/protobuf clone using:</span><br style="color: #333333; text-align: justify;" /><br style="color: #333333; text-align: justify;" /><span style="background-color: white; color: #333333; text-align: justify;">git submodule update --init </span></span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br />
At this point, I can edit the file Macros.h in Eigen.</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
This file is located in the .cache directory:</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
ubuntu@tegra-ubuntu:~/.cache$ find . -name Macros.h -print</div>
<div style="color: #00364a; font-family: Monaco; font-size: 10px; line-height: normal;">
./bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/external/eigen_archive/eigen-eigen-3f653ace7d28/Eigen/src/Core/util/Macros.h</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
The nvcc check needs to be eliminated:</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<div>
<div class="udiff-line addition" style="color: #333333; font-family: Arial, sans-serif; font-size: 14px; margin: 0px; padding: 0px;">
<pre class="source" style="background-color: #ddffdd; font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; padding: 0px 0px 0px 72px;">-#if !defined(__NVCC__) || !defined(EIGEN_ARCH_ARM_OR_ARM64)</pre>
</div>
<div class="udiff-line common" style="color: #333333; font-family: Arial, sans-serif; font-size: 14px; margin: 0px; padding: 0px;">
<div class="gutter" id="LEigen/src/Core/util/Macros.hF357T360" style="background-color: whitesmoke; border-right-color: rgb(204, 204, 204); border-right-style: solid; border-right-width: 1px; height: 18px; left: 1px; margin: 0px; padding: 0px; position: absolute;">
<a class="line-numbers" data-fnum="357" data-tnum="360" href="https://bitbucket.org/eigen/eigen/commits/d0950ac79c0404047379eb5a927a176dbb9d12a5#LEigen/src/Core/util/Macros.hF357T360" style="-webkit-user-select: none; color: #333333; float: right; font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; margin-top: 0px; text-decoration: none; width: 70px;"></a></div>
<pre class="source" style="font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; padding: 0px 0px 0px 72px;"> #define EIGEN_HAS_VARIADIC_TEMPLATES 1</pre>
</div>
<div class="udiff-line common" style="color: #333333; font-family: Arial, sans-serif; font-size: 14px; margin: 0px; padding: 0px;">
<div class="gutter" id="LEigen/src/Core/util/Macros.hF358T361" style="background-color: whitesmoke; border-right-color: rgb(204, 204, 204); border-right-style: solid; border-right-width: 1px; height: 18px; left: 1px; margin: 0px; padding: 0px; position: absolute;">
<a class="line-numbers" data-fnum="358" data-tnum="361" href="https://bitbucket.org/eigen/eigen/commits/d0950ac79c0404047379eb5a927a176dbb9d12a5#LEigen/src/Core/util/Macros.hF358T361" style="-webkit-user-select: none; color: #333333; float: right; font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; margin-top: 0px; text-decoration: none; width: 70px;"></a></div>
<pre class="source" style="font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; padding: 0px 0px 0px 72px;"> #endif</pre>
</div>
<div class="udiff-line addition" style="color: #333333; font-family: Arial, sans-serif; font-size: 14px; margin: 0px; padding: 0px;">
<div class="gutter" id="LEigen/src/Core/util/Macros.hT362" style="background-color: whitesmoke; border-right-color: rgb(204, 204, 204); border-right-style: solid; border-right-width: 1px; height: 18px; left: 1px; margin: 0px; padding: 0px; position: absolute;">
<a class="line-numbers" data-tnum="362" href="https://bitbucket.org/eigen/eigen/commits/d0950ac79c0404047379eb5a927a176dbb9d12a5#LEigen/src/Core/util/Macros.hT362" style="-webkit-user-select: none; color: #333333; float: right; font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; margin-top: 0px; text-decoration: none; width: 70px;"></a></div>
<pre class="source" style="background-color: #ddffdd; font-family: Consolas, Menlo, 'Liberation Mono', Courier, monospace; font-size: 12px; height: 18px; line-height: 18px; padding: 0px 0px 0px 72px;">-#endif</pre>
</div>
</div>
</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
We can now restart the build and it will go through. </div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
After you are done, you can test it with:</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div>
$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu<br />
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">You should see a similar output:</span></div>
<div style="color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0px;"></span><br /></div>
# Lots of output. This tutorial iteratively calculates the major eigenvalue of<br />
# a 2x2 matrix, on GPU. The last few lines look like this.<br />
000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]<br />
000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]<br />
000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]</div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">We are now ready to create the pip package and install it:</span></div>
# To build with GPU support:<br />
$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package<br />
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg<br />
# The name of the .whl file will depend on your platform.<br />
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.8.0-cp27-none-linux_armv7l.whl</div>
<div>
<br />
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;">Congratulation, TensorFlow is now installed on your system.</span></div>
</div>
</div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
Most of the tests are passing, but the image classification example is giving the wrong results. Now that the community can build it and play with it, someone can find the source of the error(s).</div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<br /></div>
<div>
<span style="color: #36474f; font-family: "helvetica"; font-size: 16px;">I downloaded the python files from TensorFlow-Tutorial and they seem to work:</span></div>
<div>
<br /></div>
<div>
git clone https://github.com/nlintz/TensorFlow-Tutorials.git</div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0px;"><br /></span></div>
Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com16tag:blogger.com,1999:blog-5417613544042855153.post-27688015065077364152015-11-27T13:09:00.002-08:002015-12-13T08:47:19.734-08:00Building TensorFlow for Jetson TK1<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs. </span></span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><span style="letter-spacing: 0px;">TensorFlow has a GPU backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway. This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download </span>the<span style="letter-spacing: 0px;"> wheel file I generated and do a pip install. The </span>file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing )<span style="letter-spacing: 0px;">.</span><span style="font-size: 16px; letter-spacing: 0px;"> </span></span><br />
<div style="font-size: 16px; min-height: 19px;">
<span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;"><span style="letter-spacing: 0.0px;"></span><br /></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new </span></span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "helvetica neue" , "arial" , "helvetica" , sans-serif;">The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.</span></span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 19px;">
<span style="letter-spacing: 0.0px;"><b>Java 8:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">The first step is to install Java8, but this is quite simple since Oracle provides a package:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #f9f9f9; color: #323333; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo add-apt-repository ppa:webupd8team/java</span></div>
<div style="background-color: #f9f9f9; color: #323333; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo apt-get update</span></div>
<div style="background-color: #f9f9f9; color: #323333; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo apt-get install oracle-java8-installer</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 19px;">
<span style="letter-spacing: 0.0px;"><b>Protobuf:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">In order to build protobuf and bazel, we will need several other packages. The exact list will depend on the status of your Jetson, but you will need at least these ones:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev </span></div>
<div style="font-size: 16px;">
<span style="font-family: "helvetica neue"; letter-spacing: 0px;"><br />
</span><span style="letter-spacing: 0.0px;">After downloading the latest source from github:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ git clone https://github.com/google/protobuf.git</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">you need to first generate the configuration file and then run make:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cd protobuf</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ ./autogen.sh </span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ ./configure </span><span style="color: #323333; letter-spacing: 0.0px;">--prefix=/usr</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ make -j 4</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo make install</span></div>
<div style="font-size: 16px;">
<span style="font-family: "helvetica neue"; letter-spacing: 0px;"><br />
</span><span style="letter-spacing: 0.0px;">Protoc will be installed in /usr/lib and /usr/bin, this will be important when we run bazel since it tries to use a sandbox and will not find the libraries in /usr/local/lib.</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">You should see this output, if you have followed all the steps:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">ubuntu@tegra-ubuntu:~/protobuf$ protoc --version</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">libprotoc 3.0.0</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">We also need to build the java interface for protobuf, that will require Maven.</span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">Luckily maven is available from the repos, so we can just issue a:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo apt-get install maven</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">Go to the subdirectory java inside protobuf and type:</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ mvn package</span></div>
<div style="font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">Once the build is completes, there will be a protobuf-java-3.0.0-beta-1.jar inside the target subdirectory.</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 19px;">
<span style="letter-spacing: 0.0px;"><b>Bazel:</b></span></div>
<div style="font-size: 19px; min-height: 23px;">
<span style="letter-spacing: 0.0px;"><b></b></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;">We are now ready to tackle Bazel.</span></div>
<div style="background-color: #f9f9f9; color: #323333; font-size: 16px;">
<span style="letter-spacing: 0.0px;">The first step is to download the source code for Bazel ( using the 0.1.0 version, that it is known to work with Tensorflow). </span></div>
<div style="background-color: #f9f9f9; color: #323333; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ git clone https://github.com/bazelbuild/bazel.git</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cd bazel</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$git checkout tags/0.1.0</span></div>
<div style="background-color: #f9f9f9; font-family: Courier; font-size: 13px; min-height: 16px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Before compiling, we need to copy the protoc binary we just built as third_party/protobuf/protoc-linux-arm32.exe.</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We also need to copy the jar file from protobuf in the same directory. Bazel is expecting an alpha-3 version, but we have built a beta-1.</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">There is probably a better way of doing this, but just copying the file and rename it did the trick for me.</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe</span></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cp ~/protobuf/java/target/protobuf-java-3.0.0-beta-1.jar third_party/protobuf/protobuf-java-3.0.0-alpha-3.jar</span></div>
<div style="background-color: #f9f9f9; color: #323333; font-size: 13px; min-height: 16px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We are now ready to compile bazel. </span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #eeeeee; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ ./compile.sh</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">At the end of the compilation, the bazel binary will be in the output directory. You can add this directory</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">to your path or copy the binary in /usr/local/bin</span></div>
<div style="background-color: #f9f9f9; color: #323333; font-size: 13px; min-height: 16px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 19px;">
<span style="letter-spacing: 0.0px;"><b>TensorFlow</b></span></div>
<div style="font-size: 19px; min-height: 23px;">
<span style="letter-spacing: 0.0px;"><b></b></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1. </span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">the NVIDIA web site ( it is the one for Jetson TX1).</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc. </span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">All the libraries and runtime will be the standard 6.5 ones. </span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">On my system I have also enabled some swap space. You can plug a USB memory stick, create a swap file and mount it with</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo mkswap /dev/sda</span></div>
<div style="background-color: #ebebeb; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo swapon /dev/sda </span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">The first step to build TensorFlow is to clone the github repository:</span></div>
<div style="background-color: #ebebeb; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ git clone -recurse-submodules https://github.com/tensorflow/tensorflow </span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">and install other dependencies:</span></div>
<div style="background-color: #ebebeb; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo apt-get install python-numpy swig python-dev</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Before starting the installation, we will need to modify several files. We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5. We can find all the files with the strings and apply all the changes with these commands:</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cd tensorflow</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ grep -Rl "lib64"| xargs sed -i 's/lib64/lib/g'</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ grep -Rl "so.7.0"| xargs sed -i 's/so\.7\.0/so\.6\.5/g'</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"><b></b></span><br /></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">TensorFlow officially supports Cuda devices with 3.5 and 5.2 compute capabilities. We want to target a gpu with compute capabilities 3.2. </span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">This can be done through TensorFlow unofficial settings with "configure" via the TF_UNOFFICIAL_SETTING variable.</span></div>
<div style="color: #00364a; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">When prompted, specify that you only want a 3.2 compute capability device.</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ TF_UNOFFICIAL_SETTING=1 ./configure</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"># Same as the official settings above</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">WARNING: You are configuring unofficial settings in TensorFlow. Because some</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">external libraries are not backward compatible, these settings are largely</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">untested and unsupported.</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Please specify a list of comma-separated Cuda compute capabilities you want to</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">build with. You can find the compute capability of your device at:</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">https://developer.nvidia.com/cuda-gpus.</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Please note that each additional compute capability significantly increases</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">your build time and binary size. [Default is: "3.5,5.2"]: 3.2</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Setting up Cuda include</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Setting up Cuda lib</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Setting up Cuda bin</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Setting up Cuda nvvm</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Configuration finished</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">After the configure, bazel has copied or symlinked all the binaries and libraries needed for the build in the third_party/gpus/cuda subdirectory .</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">It is now time to replace the cuda compiler with the one from the 7.0 toolchain.</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We want to extract (not install) the files from the cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb package with the following commands:</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ dpkg -x cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb /tmp/cuda_repo</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cd /tmp/cuda_repo/var/cuda-repo-7-0-local</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ dpkg -x cuda-core-7-0_7.0-71_armhf.deb /tmp/cuda7.0</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ rm -fr /tmp/cuda_repo</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cd ~tensorflow/third_party/gpus/cuda</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ rm -fr bin nvvm</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/bin bin</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ cp -R /tmp/cuda7.0/usr/local/cuda-7.0/nvvm nvvm</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ rm -fr /tmp/cuda7.0</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">At this point, bazel is ready to use the 7.0 toolchain to compile Tensorflow.</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We still need to add the ARM target to the build. </span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">This can be done adding the following lines to the </span><span style="letter-spacing: 0.0px; text-decoration: underline;">third_party/gpus/crosstool/CROSSTOOL file:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">default_toolchain {</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> cpu: "arm"</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> toolchain_identifier: "local_linux"</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">} </span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Before starting the build, we need to edit few files to avoid compiler crashes and avoid double instantiations </span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">(on ARM v7, Eigen::DenseIndex is typedefed to int):</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">tensorflow/core/kernels/conv_ops_gpu_2.cu.cc</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">tensorflow/core/kernels/conv_ops_gpu_3.cu.cc</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">tensorflow/stream_executor/cuda/cuda_gpu_executor.cc</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">tensorflow/core/kernels/adjust_contrast_op.h</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;"><b>third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorDimensions.h: </b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>the compiler is crashing when evaluating the code inside the ifdef at line 312. We can just take the alternative path.</span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Change line 312 to something like:</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#ifdef EIGEN_HAS_VARIADIC_TEMPLATES_TK1</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/core/kernels/conv_ops_gpu_2.cu.cc:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>To avoid double instantiation, guard the second functor for InflateAnsShuffle with:</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">/* On ARMv7 Eigen::DenseIndex is typedefed to int */</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#ifndef __arm__</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">template struct functor::InflatePadAndShuffle<gpudevice 4="" float="" span=""></gpudevice></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::DenseIndex>;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#endif</span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>We also need to comment the tensor.h include ( will crash the compiler)</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//#include "tensorflow/core/public/tensor.h"</span></div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/core/kernels/conv_ops_gpu_3.cu.cc:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>To avoid double instantiation, guard the second functor for ShuffleAndReverse with:</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">/* On ARMv7 Eigen::DenseIndex is typedefed to int */</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#ifndef __arm__</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">template struct functor::ShuffleAndReverse<gpudevice 4="" float="" span=""></gpudevice></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::DenseIndex>;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#endif</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>ARMv7 has no numa_node file. It should return 0 not -1, otherwise TensorFlow will crash at runtime:</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">FILE *file = fopen(filename.c_str(), "r");</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> if (file == nullptr) {</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> LOG(ERROR) << "could not open file to read NUMA node: " << filename;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#ifdef __arm__</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> // There is no numa_node on Jetson TK1</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> return 0;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#else</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> return kUnknownNumaNode;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#endif</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;"><b>tensorflow/core/kernels/adjust_contrast_op.h:</b></span></div>
<div style="font-size: 16px;">
<span style="letter-spacing: 0.0px;"><span class="Apple-tab-span" style="white-space: pre;"> </span>The compiler is crashing on some initializations, we need to rewrite them in a simpler way:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//MF Eigen::array<int 4=""> scalar_broadcast{{batch, height, width, channels}};</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::array<int 4=""> scalar_broadcast;</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar_broadcast[0] = batch;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar_broadcast[1] = height;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar_broadcast[2] = width;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar_broadcast[3] = channels;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#if !defined(EIGEN_HAS_INDEX_LIST)</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//MF Eigen::array<int 2=""> reduction_axis{{1, 2}};</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//MF Eigen::array<int 4=""> scalar{{1, 1, 1, 1}};</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//MF Eigen::array<int 4=""> broadcast_dims{{1, height, width, 1}};</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">//MF Eigen::Tensor<int 4="">::Dimensions reshape_dims{{batch, 1, 1, channels}};</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::array<int 2=""> reduction_axis;</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reduction_axis[0]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reduction_axis[1]=2;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::array<int 4=""> scalar;</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar[0]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar[1]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar[2]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> scalar[3]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::array<int 4=""> broadcast_dims;</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> broadcast_dims[0]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> broadcast_dims[1]=height;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> broadcast_dims[2]=width;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> broadcast_dims[3]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> Eigen::Tensor<int 4="">::Dimensions reshape_dims;</int></span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reshape_dims[0]=batch;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reshape_dims[1]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reshape_dims[2]=1;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"> reshape_dims[3]=channels;</span></div>
<div style="background-color: #ebebeb; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">#else</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">The source code is now ready. Jeston TK1 has only 2GB of memory and bazel will try to compile several files at the same time.</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We want to avoid this, so we will pass a local_resource flag that will use only 2GB and half core (don't ask, if you specify one it will still try</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">to compile two files at the same time). This build will take a long time:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #fafafa; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/cc:tutorials_example_trainer</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">If you get some failures during the build, keep trying, bazel scheduling seems to be non-deterministic and the Tensorflow code is really stressing the</span></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">compiler.</span></div>
<div style="font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Once the build is completed, we can test the code:</span></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu</span></div>
<div style="color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">You should see a similar output:</span></div>
<div style="color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"># Lots of output. This tutorial iteratively calculates the major eigenvalue of</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"># a 2x2 matrix, on GPU. The last few lines look like this.</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">000009/000005 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">000006/000001 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">000009/000009 lambda = 2.000000 x = [0.894427 -0.447214] y = [1.788854 -0.894427]</span></div>
<div style="color: #36474f; font-size: 14px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-size: 14px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We are now ready to create the pip package and install it:</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"># To build with GPU support:</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;"># The name of the .whl file will depend on your platform.</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.5.0-cp27-none-linux_armv7l.whl</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Congratulation, TensorFlow is now installed on your system.</span></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">We can also try a more interesting example of image classification:</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">bazel build -c opt --local_resources 2048,0.5,1.0 --verbose_failures --config=cuda //tensorflow/examples/label_image/...</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="background-color: #f7f7f7; color: #323333; font-family: Consolas; font-size: 14px;">
<span style="letter-spacing: 0.0px;">$ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.zip -O tensorflow/examples/label_image/data/inception5h.zip</span></div>
<div style="background-color: #f7f7f7; color: #323333; font-family: Consolas; font-size: 14px;">
<span style="letter-spacing: 0.0px;">$ unzip tensorflow/examples/label_image/data/inception5h.zip -d tensorflow/examples/label_image/data/</span></div>
<div style="background-color: #f7f7f7; color: #323333; font-family: Consolas; font-size: 14px;">
<span style="letter-spacing: 0.0px;">$ mv tensorflow/examples/label_image/data/tensorflow_inception_graph.pb tensorflow/examples/label_image/data/googlenet_graph.pb</span></div>
<div style="background-color: #f7f7f7; color: #323333; font-family: Consolas; font-size: 14px;">
<span style="letter-spacing: 0.0px;">$ mv tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt tensorflow/examples/label_image/data/googlenet_labels.txt </span></div>
<div style="color: #36474f; font-size: 14px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="color: #36474f; font-family: Helvetica; font-size: 16px;">
<span style="letter-spacing: 0.0px;">And run it with:</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">$ bazel-bin/tensorflow/examples/label_image/label_image</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 1</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:890] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: </span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">name: GK20A</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">major: 3 minor: 2 memoryClockRate (GHz) 0.852</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">pciBusID 0000:00:00.0</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Total memory: 1.85GiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">Free memory: 218.46MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 </span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y </span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 18.46MiB bytes.</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:52] GPU 0 memory begins at 0xa45ea000 extends to 0xa585f000</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 64.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 128.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 256.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 512.0KiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 1.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 2.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 4.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 8.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 16.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:66] Creating bin of max chunk size 32.00MiB</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/direct_session.cc:60] Direct session inter op parallelism threads: 1</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GK20A, pci bus id: 0000:00:00.0)</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/examples/label_image/main.cc:221] military uniform (866): 0.902268</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/examples/label_image/main.cc:221] bow tie (817): 0.05407</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/examples/label_image/main.cc:221] suit (794): 0.0113196</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/examples/label_image/main.cc:221] bulletproof vest (833): 0.0100269</span></div>
<div style="background-color: #ebebeb; color: #36474f; font-family: 'Helvetica Neue'; font-size: 16px;">
<span style="letter-spacing: 0.0px;">I tensorflow/examples/label_image/main.cc:221] bearskin (849): 0.00649747</span></div>
<div style="color: #36474f; font-size: 14px; min-height: 18px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
<br />
<div style="font-size: 16px; min-height: 19px;">
<span style="letter-spacing: 0.0px;"></span><br /></div>
Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com12tag:blogger.com,1999:blog-5417613544042855153.post-25922001834339875922013-10-12T16:18:00.001-07:002013-10-16T17:09:07.497-07:00CUDA 5.5 and Xcode 5The latest Xcode 5 update seems to have broken nvcc.<br />
<div>
If you try to compile a CUDA program, you will see a similar error:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">%nvcc -c qr.cu</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">clang: error: unsupported option '-dumpspecs'</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">clang: error: no input files</span></div>
</div>
<div>
<br /></div>
<div>
There is a simple workaround.</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">%nvcc -ccbin=/usr/bin/clang</span> <span style="font-family: 'Courier New', Courier, monospace;">-c qr.cu</span></div>
<div>
<br /></div>
</div>
<div>
A more convenient way of adding this, it is to define an alias for nvcc.</div>
<div>
You can add this line to your .bash_profile</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">alias nvcc='nvcc -ccbin=/usr/bin/clang'</span></div>
<div>
<br /></div>
<div>
or just define it in your shell,</div>
<div>
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">alias 'nvcc=nvcc -ccbin=/usr/bin/clang'</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<br /></div>
Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com0tag:blogger.com,1999:blog-5417613544042855153.post-39651845393046034412013-09-11T12:39:00.002-07:002013-09-23T16:27:45.357-07:00Calling CUDA Fortran kernels from MATLABThe latest MATLAB versions, starting from 2010b, have a very cool feature that enables calling CUDA C kernels from MATLAB code.<br />
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.<br />
<br />
Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">__global__ void add(double * in, double a, int N) {</span><br />
<span style="font-family: Courier New, Courier, monospace;"> int idx = blockIdx.x * blockDim.x + threadIdx.x;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> if (idx < N) {</span><br />
<span style="font-family: Courier New, Courier, monospace;"> in[idx] += a;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> }</span><br />
<span style="font-family: Courier New, Courier, monospace;">}</span><br />
<br />
<div>
The full documentation is available at<br />
<a href="http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html" target="_blank">http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html</a><br />
I am just going to summarize the required steps:<br />
<br /></div>
<ul>
<li>Generate a PTX file from the kernel source</li>
<ul>
<li><i>nvcc -ptx -arch sm_20 add.cu</i></li>
</ul>
<li><i><span style="font-style: normal;">Construct the kernel object from the PTX file</span></i></li>
<ul>
<li><i><span style="font-style: normal;"><i>k=parallel.gpu.CUDAKernel('add.ptx','add.cu');</i></span></i></li>
</ul>
<li>Set up the block and grid configuration, for example 28 blocks of 256 threads each:</li>
<ul>
<li><i>k.ThreadBlockSize=[256 1 1]</i></li>
<li><i>k.GridSize=[28 1 1]</i></li>
</ul>
<li>Execute the kernel.</li>
<ul>
<li><i>o = feval(k,rand(10,1),2.,10)</i></li>
<li>The gpu array o contains the output of the kernel</li>
</ul>
</ul>
<ol><ul>
</ul>
</ol>
<br />
It is possible to do the same with CUDA Fortran.<br />
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want<br />
to learn more about CUDA Fortran there is a very good book you can pre-order from Amazon,<br />
"<span style="background-color: white; font-family: Times, 'Times New Roman', serif;"><a href="http://www.amazon.com/CUDA-Fortran-Scientists-Engineers-Programming/dp/0124169708" target="_blank">CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming</a>"). This is the equivalent code :</span><br />
<br />
<span style="font-family: Courier New, Courier, monospace;">attributes(global) subroutine add(a, b, N)</span><br />
<span style="font-family: Courier New, Courier, monospace;"> implicit none</span><br />
<span style="font-family: Courier New, Courier, monospace;"> double precision, intent(inout) :: a(*)</span><br />
<span style="font-family: Courier New, Courier, monospace;"> double precision, value :: b</span><br />
<span style="font-family: Courier New, Courier, monospace;"> integer , value :: N</span><br />
<span style="font-family: Courier New, Courier, monospace;"> integer :: i</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> i = threadIdx%x+(blockIdx%x-1)*blockDim%x</span><br />
<span style="font-family: Courier New, Courier, monospace;"> if ( i <=N) a(i) = a(i)+b</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"> end subroutine add</span><br />
<div>
<br /></div>
For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right<br />
flags to generate the PTX file:<br />
<br />
pgf90 -c -Mcuda=keepptx,cc20 addf.cuf<br />
<ul>
</ul>
The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.<br />
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.<br />
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:<br />
<br />
setenv('BASH_ENV','~/.bash_profile');<br />
<br />
and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.<br />
<br />
!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf<br />
<br />
<br />
<div style="text-align: left;">
In order to load the PTX file in MATLAB, we need to slightly change the syntax.</div>
<div style="text-align: left;">
When loading the PTX file generated by CUDA C, we were passing both the PTX file name and</div>
<div style="text-align: left;">
the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel. </div>
<br />
This is what we need to load the PTX file generated from CUDA Fortran.<br />
<br />
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');<br />
<br />
Once we have created the kernel object kf, the calling sequence is the same one we used before.<br />
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:<br />
<br />
<ol><ul>
<li>kf.ThreadBlockSize=[256 1 1]</li>
<li>kf.GridSize=[28 1 1]</li>
</ul>
</ol>
and execute the kernel.<br />
<ol><ul>
<li>of = feval(kf,rand(10,1),2.,10)</li>
</ul>
</ol>
<div>
<br /></div>
<div>
This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:</div>
<div>
<br /></div>
<div>
% Create a 1D array of doubles with 10 elements</div>
<div>
i1=gpuArray(rand(10,1))</div>
<div>
% Create the kernel object from the PTX file with explicit prototype<br />
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')</div>
<div>
% Set execution configuration<br />
kf.ThreadBlockSize=[256 1 1]<br />
kf.GridSize=[28 1 1]</div>
<div>
% Execute the kernel<br />
of=feval(kf,i1,10.,10)</div>
<br />
<br />
An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.<br />
<br />
<br />
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
Now that we understand all the steps, let's move to something more complex and discuss few more points.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
The kernel is embedded in a module, since we are using a global variable for the lock. </div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers </div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
for the addressing , each thread will process multiple data if needed.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
This is the code:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
module sumgpu</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
implicit none</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
integer, parameter :: fp_kind = kind(0.0d0)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
integer, device:: lock=0</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
contains</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
attributes(global) subroutine sum(input,totalsum,N)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
real(fp_kind), intent(in) :: input(N)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
real(fp_kind) :: totalsum(1)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
integer,value :: N</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
real(fp_kind), shared, dimension(256) :: psum</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
integer :: i,index, inext</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
real(fp_kind) :: lsum</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
lsum = 0._fp_kind</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
do i=index,N,BlockDim%x*GridDim%x</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
lsum = lsum+ input(i)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end do</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
! Local reduction per block</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
index=threadIdx%x</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
psum(index)=lsum</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
call syncthreads()</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
inext=blockDim%x/2</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
do while ( inext >=1 )</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
if (index <=inext) psum(index)=psum(index)+psum(index+inext)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
inext = inext /2</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
call syncthreads()</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end do</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
! Final reduction among block with atomic lock</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
if (index == 1) then</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
do while ( atomiccas(lock,0,1) == 1)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end do</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
totalsum(1)=totalsum(1)+psum(1)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
call threadfence()</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
lock =0</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end if</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end subroutine sum</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
end module sumgpu</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
If we generate and load the module as seen before, we can observe the following:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf = </div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
CUDAKernel with properties:</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ThreadBlockSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxThreadsPerBlock: 1024</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
GridSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
SharedMemorySize: 0</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
EntryPoint: 'sumgpu_sum_'</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxNumLHSArguments: 2</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
NumRHSArguments: 3</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ArgumentTypes: {'inout double vector' 'inout double vector' 'in int32 scalar'}</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')</div>
<div style="color: #b92d5d; font-family: 'Courier New'; font-size: 13px;">
Error using handleKernelArgs (line 61)</div>
<div style="color: #b92d5d; font-family: 'Courier New'; font-size: 13px;">
Found more than one entry point in the PTX code. Possible names are:</div>
<div style="color: #b92d5d; font-family: 'Courier New'; font-size: 13px;">
sumgpu_sum_</div>
<div style="color: #b92d5d; font-family: 'Courier New'; font-size: 13px;">
sumgpu_sum2_</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','<span style="color: #b92d5d;">sumgpu_sum_'</span>)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf = </div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
CUDAKernel with properties:</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ThreadBlockSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxThreadsPerBlock: 1024</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
GridSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
SharedMemorySize: 0</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
EntryPoint: 'sumgpu_sum_'</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxNumLHSArguments: 2</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
NumRHSArguments: 3</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ArgumentTypes: {'inout double vector' 'inout double vector' 'in int32 scalar'}</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
The command now completes correctly. However, with the prototype signature we specified, the first array that in the original code was</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'. This is not a major problem, but we will</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
We can fix the problem, changing the prototype signature to:</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf = </div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
CUDAKernel with properties:</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ThreadBlockSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxThreadsPerBlock: 1024</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
GridSize: [1 1 1]</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
SharedMemorySize: 0</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
EntryPoint: 'sumgpu_sum_'</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
MaxNumLHSArguments: 1</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
NumRHSArguments: 3</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ArgumentTypes: {'in double vector' 'inout double vector' 'in int32 scalar'}</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only. </div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
We are now ready to run the code:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Generate an array of 1024 elements on the CPU</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
a=rand(1024,1);</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Copy the array to a GPU array ag</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ag=gpuArray(a);</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Generate the kernel object and setup the execution configuration</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf.ThreadBlockSize=[256 1 1];</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf.GridSize=[28 1 1];</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Initialize the sum to zero</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
sumg=gpuArray.zeros(1,'double');</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Invoke the kernel</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
disp('CUDA Fortran kernel:')</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
sumg=feval(kf,ag,sumg,1024)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Recompute the sum using the intrinsic MATLAB function</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
disp('Intrinsic MATLAB sum on GPU:')</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
sum_matlab=sum(ag)</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Check the result</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
disp('Difference:')</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
sumg-sum_matlab</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
obtaining the following output:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
CUDA Fortran kernel:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
sumg =</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
509.2181</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
Intrinsic MATLAB sum on GPU:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
sum_matlab =</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
509.2181</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
Difference:</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
ans =</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
0</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
We will generate 50 millions random number directly on the GPU and then compute their sum.</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Set up random number generation on the GPU</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
seed=0;</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
parallel.gpu.RandStream.setGlobalStream(gpu_stream);</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
N=50000000;</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Generate the random numbers directly on the GPU</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
ag=gpuArray.randn(N,1);</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
%Generate the kernel object and setup the execution configuration</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf.ThreadBlockSize=[256 1 1];</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
kf.GridSize=[128 1 1];</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Initialize the sum to zero</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
sumg=gpuArray.zeros(1,'double');</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Invoke the kernel and time the execution</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
tic;sumg=feval(kf,ag,sumg,N);toc</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
% Invoke the intrinsic sum and time the execution</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
tic;sum(ag);toc</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
Elapsed time is 0.000357 seconds.</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px;">
Elapsed time is 0.000393 seconds.</div>
<div style="color: #323333; font-family: 'Courier New'; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use</div>
<div style="color: #323333; font-family: Georgia; font-size: 13px;">
standalone code that was already written in a very productive environment such as MATLAB.</div>
<div>
<br /></div>
<br />
<br />
<br />Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com4tag:blogger.com,1999:blog-5417613544042855153.post-68395760347127567752013-07-15T16:43:00.001-07:002016-05-09T16:01:04.964-07:00Enabling CUDA Multi Process Service (MPS) with multiple GPUs.(Edited 05/09/2016)<br />
<span style="font-family: Helvetica; orphans: 2; text-align: -webkit-auto; widows: 2;">CUDA 7 introduced MPS support for multi GPU nodes.</span><br />
<span style="font-family: Helvetica; orphans: 2; text-align: -webkit-auto; widows: 2;">CUDA_VISIBLE_DEVICE should not be used to handle GPU affinity when a CUDA-aware MPI is used, because of issues with CUDA IPC.</span><br />
<span style="font-family: Helvetica; orphans: 2; text-align: -webkit-auto; widows: 2;"><br /></span>
(Edited 10/21/13 to use MPS control daemon instead of MPS server)<br />
<br />
CUDA 5.5 has a new interesting feature, called CUDA Multi Process Service (MPS), for GPUs with compute capability 3.5.<br />
<br />
CUDA MPS, formerly known as CUDA Proxy, is a feature that allows multiple CUDA processes to share a single GPU context. NVIDIA officially supports configurations with a single GPU, but it is possible to run it on systems with multiple GPUs creating the MPS servers manually.<br />
This post will show how to enable this feature when multiple GPUs are present in a system.<br />
It is an unsupported but working configuration.<br />
<br />
The first thing to do it is to create a MPS control daemon for each GPU.<br />
We will use CUDA_VISIBLE_DEVICES to select each GPU and create two directories in /tmp for each MPS control daemon. one for the pipe, the other for the log. By default, CUDA MPS will try to create a log directory in /var/log, requiring the control daemon to be executed with root privileges. By selecting a log directory in /tmp ( or any other directory of your choice that is accessible from normal users), we don't need root privileges to start the control daemons.<br />
<br />
#!\bin\bash<br />
<br />
<br />
# Number of gpus with compute_capability 3.5 per server<br />
NGPUS=2<br />
<br />
# Start the MPS server for each GPU<br />
for ((i=0; i< $NGPUS; i++))<br />
<br />
do<br />
mkdir /tmp/mps_$i<br />
mkdir /tmp/mps_log_$i<br />
export CUDA_VISIBLE_DEVICES=$i<br />
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i<br />
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i<br />
nvidia-cuda-mps-control -d<br />
end do<br />
<br />
Once we have set up the control daemons we need to point the CUDA executable we want to run to the right MPS control daemon. This is done in a non-standard way.<br />
Instead of using the CUDA_VISIBLE_DEVICES variable, as normally done with CUDA, we will need to set CUDA_VISIBLE_DEVICES to 0 and select the explicit MPS pipe we want to use by specifying the proper CUDA_MPS_PIPE_DIRECTORY.<br />
<br />
<br />
To start two instances of a.out on GPU 0 using proxy, we will type:<br />
<br />
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0<br />
./a.out<br />
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0<br />
./a.out<br />
<br />
The execution script is a little more complex if we are running a MPI application.<br />
In this case, we will need to find a way to detect how many MPI processes are running on a node.<br />
OpenMPI has a variable that will tell us this info, other MPI implementations offer similar environment<br />
variables.<br />
<br />
This script shows how to run local process 0 and 2 on GPU 0 and 1 and 3 on GPU 1.<br />
<br />
#!/bin/bash<br />
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
#run script for MPI<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
export CUDA_VISIBLE_DEVICES=0<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
<span class="SpellE">lrank</span>=$OMPI_COMM_WORLD_LOCAL_RANK<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
case ${<span class="SpellE">lrank</span>} in<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
[0])<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
export CUDA_MPS_PIPE_DIRECTORY=/<span class="SpellE">tmp</span>/mps_0<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
./executable<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
;;<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
[1])<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
export CUDA_MPS_PIPE_DIRECTORY=/<span class="SpellE">tmp</span>/mps_1<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
./executable<o:p></o:p></div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
;;<o:p></o:p><br />
<br />
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
[2])<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
export CUDA_MPS_PIPE_DIRECTORY=/<span class="SpellE">tmp</span>/mps_0<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
./executable<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
;;<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
[3])<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
export CUDA_MPS_PIPE_DIRECTORY=/<span class="SpellE">tmp</span>/mps_1<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
./executable<o:p></o:p></div>
<div class="MsoNormal" style="font-size: 11pt; margin: 0in 0in 0.0001pt;">
;;</div>
</div>
<div class="MsoNormal" style="font-family: Calibri, sans-serif; font-size: 11pt; margin: 0in 0in 0.0001pt;">
<span class="SpellE">esac</span><br />
<span class="SpellE"><br /></span>
<span class="SpellE">Once the execution is completed, we need to clean up the MPS <span style="font-family: "times"; font-size: small;">control daemons </span> if other users are supposed to run on the system.</span><br />
<span class="SpellE"><br /></span>
<span class="SpellE">#!/bin/bash</span><br />
<span class="SpellE"></span><br />
<span class="SpellE"># Stop the MPS </span><span style="font-family: "times"; font-size: small;">control daemon</span><span style="font-size: 11pt;"> for each GPU and clean up /tmp</span><br />
<span class="SpellE"><br /></span>
<span class="SpellE">for ((i=0; i< $NGPUS; i++))</span><br />
<span class="SpellE">do</span><br />
<span class="SpellE">echo $i</span><br />
<span class="SpellE"> </span><span style="font-size: 11pt;">export CUDA_MPS_PIPE_DIRECTORY=/</span><span class="SpellE" style="font-size: 11pt;">tmp</span><span style="font-size: 11pt;">/mps_$i</span><br />
<span style="font-size: 11pt;"> echo "quit" | </span><span style="font-family: "times"; font-size: small;">nvidia-cuda-mps-control</span><br />
<span class="SpellE"> rm -fr /tmp/mps_$i</span><br />
<span class="SpellE"> rm -fr /tmp/mps_log_$i</span><br />
<span class="SpellE">done</span><br />
<span class="SpellE"><br /></span>
<span class="SpellE">The creation and clean-up could be combined in a single script.</span></div>
Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com3tag:blogger.com,1999:blog-5417613544042855153.post-40474153507578967982012-11-21T09:59:00.002-08:002012-11-21T09:59:32.415-08:00Using Thrust on CARMA<br />
<div style="font-family: Helvetica; font-size: 12px;">
Thurst is an excellent library for CUDA development.</div>
<div style="font-family: Helvetica; font-size: 12px;">
Unfortunately, Thrust is not present in the CARMA Toolkit but it is easy to install.</div>
<div style="font-family: Helvetica; font-size: 12px;">
<br /></div>
<div style="font-family: Helvetica; font-size: 12px;">
On the x86 development system, we are going to pull down the latest source from Thrust using git.</div>
<div style="font-family: Helvetica; font-size: 12px;">
If git is not installed, we can easily add to the system with:</div>
<div style="font-family: Helvetica; font-size: 12px; min-height: 14px;">
<br /></div>
<div style="font-size: 12px;">
<span style="font-family: Courier New, Courier, monospace;"> sudo apt-get install git</span></div>
<div style="font-family: Helvetica; font-size: 12px; min-height: 14px;">
<br /></div>
<div style="font-family: Helvetica; font-size: 12px; min-height: 14px;">
and then clone the git repository</div>
<div style="font-family: Helvetica; font-size: 12px; min-height: 14px;">
<br /></div>
<div style="font-size: 12px;">
<span style="font-family: Courier New, Courier, monospace;"> git clone https://github.com/thrust/thrust.git</span></div>
<div style="font-family: Helvetica; font-size: 12px;">
<br /></div>
<div style="font-family: Helvetica; font-size: 12px;">
<br /></div>
<div style="font-family: Helvetica; font-size: 12px;">
We are now ready to cross-compile. Remember that Thrust is a template library, everything is build from include files.</div>
<div style="font-family: Helvetica; font-size: 12px;">
Using our standard Makefile, we just need to add the directory in which the Thrust include files are ( in this case /home/ubuntu/thrust). </div>
<div style="font-family: Helvetica; font-size: 12px;">
We also want to restrict the code generation to arch sm_21 ( the CARMA kit has a Q1000m GPU with 2.1 compute capabilities) to reduce the compilation time.</div>
<div style="font-family: Helvetica; font-size: 12px;">
We are going to use one of the examples shipping with Thrust, monte_carlo.cu</div>
<div style="font-family: Helvetica; font-size: 12px;">
<br /></div>
<div style="font-size: 12px;">
</div>
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;"># Makefile for cross-compile #</span><br />
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;">all : monte_carlo</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">CUDA_HOME=/usr/local/cuda</span><br />
<span style="font-family: Courier New, Courier, monospace;">CC=/arm-linux-gnueabi-gcc</span><br />
<span style="font-family: Courier New, Courier, monospace;">NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32</span><br />
<span style="font-family: Courier New, Courier, monospace;">THRUST_LOC=/home/ubuntu/thrust</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">monte_carlo : monte_carlo.cu</span><br />
<span style="font-family: Courier New, Courier, monospace;"> $(NVCC) -O3 -arch sm_21 -o monte_carlo -I$(THRUST_LOC) monte_carlo.cu</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">clean:</span><br />
<span style="font-family: Courier New, Courier, monospace;"> rm monte_carlo</span><br />
<div style="font-family: Helvetica;">
<br /></div>
<div style="font-family: Helvetica;">
Once we generate the executable, we can copy it on the CARMA </div>
<div style="font-family: Helvetica;">
<br /></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> scp monte_carlo ubuntu@carma:~</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Helvetica;">and execute it. We will see the number pi printed with 2 digits ( 3.14).</span></div>
<div>
<span style="font-family: Helvetica;">If you want to see more digits, you can change the source code and set the precision to 6 instead of the original 2</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> std::cout << std::setprecision(6);</span></div>
<div style="font-family: Helvetica;">
<br /></div>
<br />
Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com0tag:blogger.com,1999:blog-5417613544042855153.post-26239870516277909852012-10-29T16:51:00.000-07:002012-10-31T00:51:29.403-07:00Setting up a CARMA kitI just received a brand new CARMA kit and I am going to post all the steps I did to get a working set-up.<br />
<br />
Let's start with the x86 development system. I am using a virtual machine on my Mac as my development system.<br />
<br />
I started by installing a fresh Ubuntu 11.04 distro and then proceed to :<br />
<ul>
<li>Update the packages: </li>
<ul>
<li><i>sudo apt-get update</i></li>
</ul>
<li>Install the basic developer tools: </li>
<ul>
<li><i>sudo apt-get install build-essential</i></li>
</ul>
<li>Install the 32bit development libraries ( CARMA is 32bit ):</li>
<ul>
<li>s<i>udo apt-get install ia32-libs</i></li>
</ul>
<li>Install the ARM cross compilers: </li>
<ul>
<li><em style="background-color: #f7f7f7; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 13px; margin: 0px; padding: 0px; text-align: left;"><span lang="EN-GB" style="font-size: 10pt; margin: 0px; padding: 0px;">sudo apt-get install gcc-4.5-arm-linux-gnueabi g++-4.5-arm-linux-gnueabi</span></em></li>
</ul>
<li><span style="font-family: Times, Times New Roman, serif;">Install Fortran for both x86 and ARM (real developers use Fortran....):</span></li>
<ul>
<li><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: x-small;">sudo apt-get install gfortran-4.5-*</span></li>
</ul>
<li><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: x-small;">I</span><span style="font-family: Times, Times New Roman, serif;">nstall the CUDA Toolkit (available from http://www.seco.com/carmakit under the downloads tab): </span></li>
<ul>
<li><span style="font-family: Verdana, Arial, Helvetica, sans-serif; font-size: x-small;">sudo sh </span>cuda-linux-ARMv7-rel-4.2.10-13489154.run</li>
</ul>
<li>Edit .bashrc to add nvcc to the path. With your favorite editor add a line at the end of the file:</li>
<ul>
<li>export PATH=/usr/local/cuda/bin:$PATH</li>
</ul>
<li>Source the .bashrc to refresh the path ( it will be automatically executed the next time you login or open a terminal):</li>
<ul>
<li>. .bashrc</li>
</ul>
</ul>
We can check that nvcc is now in our path, invoking the compiler with the -V flag to check the version<br />
<br />
<br />
max@ubuntu:~$ nvcc -V<br />
nvcc: NVIDIA (R) Cuda compiler driver<br />
Copyright (c) 2005-2012 NVIDIA Corporation<br />
Built on Tue_Jul_17_14:48:12_PDT_2012<br />
Cuda compilation tools, release 4.2, V0.2.1221<br />
<br />
We are now ready to compile our first CUDA code, a comparison between multiplications on CPU and GPU.<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">#include "stdio.h"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">__global__ void kernel(int i, float *d_n)</span><br />
<span style="font-family: Courier New, Courier, monospace;">{</span><br />
<span style="font-family: Courier New, Courier, monospace;">*d_n *= 1.02f;</span><br />
<span style="font-family: Courier New, Courier, monospace;">}</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">void main(){</span><br />
<span style="font-family: Courier New, Courier, monospace;"> float n = 1.0f, *d_n;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> float n_ref = 1.0f;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> int i;</span><br />
<span style="font-family: Courier New, Courier, monospace;"> cudaMalloc((void **)&d_n, sizeof(float));</span><br />
<span style="font-family: Courier New, Courier, monospace;"> for(i = 1; i <= 10; i++)</span><br />
<span style="font-family: Courier New, Courier, monospace;"> {</span><br />
<span style="font-family: Courier New, Courier, monospace;"> cudaMemcpy(d_n, &n, sizeof(float), cudaMemcpyHostToDevice);</span><br />
<span style="font-family: Courier New, Courier, monospace;"> kernel <<< 1, 1 >>> (i, d_n);</span><br />
<span style="font-family: Courier New, Courier, monospace;"> cudaMemcpy(&n, d_n, sizeof(float), cudaMemcpyDeviceToHost);</span><br />
<span style="font-family: Courier New, Courier, monospace;"> printf("%d\t\t%42.41f\t%42.41f\n", i, n,n_ref*=1.02f);</span><br />
<span style="font-family: Courier New, Courier, monospace;"> }</span><br />
<span style="font-family: Courier New, Courier, monospace;">}</span><br />
<br />
<br />
We are going to use a Makefile similar to the one posted in the previous blog.<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">max@ubuntu:~$ cat Makefile </span><br />
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;"># Makefile for cross-compile #</span><br />
<span style="font-family: Courier New, Courier, monospace;">############################</span><br />
<span style="font-family: Courier New, Courier, monospace;">all : gpu_test</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">CUDA_HOME=/usr/local/cuda</span><br />
<span style="font-family: Courier New, Courier, monospace;">CC=/arm-linux-gnueabi-gcc</span><br />
<span style="font-family: Courier New, Courier, monospace;">NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">gpu_test : gpu_test.cu</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>$(NVCC) -o gpu_test gpu_test.cu </span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;">clean:</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>rm gpu_test</span><br />
<br />
<br />
<br />
When we type make, we should see a similar output<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">max@ubuntu:~$ make</span><br />
<span style="font-family: Courier New, Courier, monospace;">/usr/local/cuda/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32 -o gpu_test gpu_test.cu </span><br />
<span style="font-family: Courier New, Courier, monospace;">/usr/lib/gcc/arm-linux-gnueabi/4.5.2/../../../../arm-linux-gnueabi/bin/ld: warning: libc.so, needed by /usr/arm-linux-gnueabi/lib//libgcc_s.so.1, not found (try using -rpath or -rpath-link)</span><br />
<br />
<br />
<br />
<div style="background-color: white; color: #333333; line-height: 18px; margin-bottom: 1.2em; max-width: 45em; padding: 0px; text-align: left; width: auto;">
<span style="font-family: Times, Times New Roman, serif;">Don't worry about the warning. This is caused by a bogus DT_NEEDED entry in the shared libgcc file /usr/arm-<wbr></wbr>linux-gnueabi/<wbr></wbr>lib/libgcc_<wbr></wbr>s.so.1. "readelf -a" shows:<br /> 0x00000001 (NEEDED) Shared library: [libc.so]</span></div>
Before we could use the machine for any real CUDA development, there is an extra step that we will need to perform. The CUDA Toolkit is missing the libcuda.so ( it usually comes with the driver on the x86 platform, don't ask me why it was not included in the ARM toolkit), we will not be able to link any CUDA code before we bring this library to the x86. We will do this step once we have the CARMA up and running.<br />
<br />
<br />
<div style="font-size: 16px;">
Unpack the CARMA, plugin keyboard and mouse, plus the HDMI cable in the middle connector.</div>
<div style="font-size: 16px;">
Plug in the power and ethernet cable and you are ready to go.</div>
<div style="font-size: 16px;">
The first boot may be slow, the system is building the NVIDIA driver. It is a blind boot, there is no console output until the GUI comes up, so you need to have a little bit of patience.</div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-size: 16px;">
Once the CARMA system boots, it will auto-login and start a terminal. It should also pick up an IP address ( use ifconfig to find out the IP). The default username/password is ubuntu/ubuntu.</div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-size: 16px;">
We are ready to check if our cross-compilation worked. </div>
<div style="font-size: 16px;">
From inside the virtual machine, we will copy the file gpu_test to the CARMA ( ipconfig is reporting </div>
<div style="font-size: 16px;">
172.16.174.185 ):</div>
<div style="font-size: 16px;">
<br /></div>
<div style="font-size: 16px;">
<i>scp gpu_test ubuntu@172.16.174.185 :~</i></div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-size: 16px;">
Either from the CARMA terminal or from a remote shell, we can run gpu_test and check that the CPU and GPU results are the same.</div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">ubuntu@tegra-ubuntu:~$ ./gpu_test </span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">1<span class="Apple-tab-span" style="white-space: pre;"> </span>1.01999998092651367187500000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.01999998092651367187500000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">2<span class="Apple-tab-span" style="white-space: pre;"> </span>1.04039990901947021484375000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.04039990901947021484375000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">3<span class="Apple-tab-span" style="white-space: pre;"> </span>1.06120789051055908203125000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.06120789051055908203125000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">4<span class="Apple-tab-span" style="white-space: pre;"> </span>1.08243203163146972656250000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.08243203163146972656250000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">5<span class="Apple-tab-span" style="white-space: pre;"> </span>1.10408067703247070312500000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.10408067703247070312500000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">6<span class="Apple-tab-span" style="white-space: pre;"> </span>1.12616229057312011718750000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.12616229057312011718750000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">7<span class="Apple-tab-span" style="white-space: pre;"> </span>1.14868545532226562500000000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.14868545532226562500000000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">8<span class="Apple-tab-span" style="white-space: pre;"> </span>1.17165911197662353515625000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.17165911197662353515625000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">9<span class="Apple-tab-span" style="white-space: pre;"> </span>1.19509232044219970703125000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.19509232044219970703125000000000000000000</span></div>
<div style="font-size: 10px;">
<span style="font-family: Courier New, Courier, monospace;">10<span class="Apple-tab-span" style="white-space: pre;"> </span>1.21899414062500000000000000000000000000000<span class="Apple-tab-span" style="white-space: pre;"> </span>1.21899414062500000000000000000000000000000</span></div>
<div style="font-size: 16px; min-height: 19px;">
<br /></div>
<div style="font-size: 16px;">
The CARMA filesystem is quite bare, let's add few useful packages:</div>
<ul>
<li style="font-size: 16px; margin: 0px;">Install Fortran:</li>
<ul>
<li style="font-size: 16px; margin: 0px;"><i>sudo apt-get install gfortran</i></li>
</ul>
</ul>
<div style="font-size: 16px;">
We need to install OpenMPI from source, the default packages don't seem to work.</div>
<div style="font-size: 16px;">
The latest source (1.6.2) has support for ARM, the installation is very simple but it will take a while.</div>
<div style="font-size: 16px; min-height: 19px;">
<br />
Get the latest stable version </div>
<div style="font-size: 16px;">
wget http://www.open-mpi.org/software/ompi/v1.6/downloads/openmpi-1.6.2.tar.gz</div>
<div style="font-size: 16px;">
<br />
unpack it ( tar xvfz openmpi-1.6.2.tar.gz) and change the directory ( cd openmpi-1.6.2 )<br />
<br />
We are now ready to build and install<br />
./configure</div>
<div style="font-size: 16px;">
sudo make -j 4 install<br />
<br />
Add /usr/local/bin to your PATH and /usr/local/lib to your LD_LIBRARY_PATH<br />
<br />
<br />
<br /></div>
<br />
<br />Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com2tag:blogger.com,1999:blog-5417613544042855153.post-32012779379255222682012-09-30T17:43:00.001-07:002012-10-26T13:53:08.253-07:00Compiling for CARMA<div style="text-align: justify;">
In few days, <a href="http://www.seco.com/en/item/carma-devkit/" target="_blank">CARMA</a> will be finally available to the general public. If you are not familiar with the CARMA project, it is the first ARM platform supporting CUDA.</div>
<div style="text-align: justify;">
It has a Tegra 3 with 4 cores and 2 GB of memory, ethernet, USB ports and a Quadro 1000M GPU (GF108 with 2 GB of memory, 96 CUDA cores, compute capability 2.1).</div>
<div style="text-align: justify;">
It has full OpenGL and CUDA support, but at the moment, no CUDA compiler.</div>
<br />
The developer needs to cross-compile from a Linux x86 machine. This blog shows how easy it is to cross-compile once we follow some simple instructions. I strongly suggest that you start with an Ubuntu machine, the cross-compiler are easily available under this platform.<br />
<br />
The first thing to do, it is to install the cross-compilers:<br />
<br />
sudo apt-get install g++-arm-linux-gnueabi gcc-arm-linux-gnueabi<br />
<br />
At this point, we will have the cross-compilers installed under /usr/bin/arm-linux-gnueabi-gcc and /usr/bin/arm-linux-gnueabi-g++.<br />
<br />
The second step is to install the CUDA Toolkit for ARM on the x86. If you choose the default location,<br />
the installer will create a directory /usr/local/cuda.<br />
<br />
If you need to use other libraries for ARM, you will also need to copy the libraries and corresponding header files from CARMA to the x86 machine. You can place them under /usr/local/arm_lib and /usr/local/arm_include or you can just put them under /usr/local/cuda/lib and /usr/local/cuda/include (my preference will be for the first option to not pollute the CUDA installation).<br />
<br />
We are now ready to compile our code, taking care of using the cross compiler and the special nvcc in the CARMA toolkit. The following makefile will show how to compile a simple c++ code that calls a CUBLAS function and a simple CUDA code.<br />
<br />
<br />
############################<br />
# Makefile for cross-compile #<br />
############################<br />
all : dgemm_cublas simple_cuda<br />
<br />
CUDA_HOME=/usr/local/cuda<br />
CC=/arm-linux-gnueabi-gcc<br />
NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32<br />
<br />
<br />
# For a standard c++ code, we use CC and the CUDA ARM libraries<br />
dgemm_cublas : gemm_test.cpp<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>$(CC) gemm_test.cpp -I$(CUDA_HOME)/include -o dgemm_cublas -L/$(CUDA_HOME)/lib -lcudart -lcublas<br />
<br />
# For a standard CUDA code, we just invoke nvcc<br />
simple_cuda: file.cu<br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>$(NVCC) -o simple_cuda file.cu<br />
<br />
clean :<span class="Apple-tab-span" style="white-space: pre;"> </span><br />
<span class="Apple-tab-span" style="white-space: pre;"> </span>rm -f *.o dgemm_cublas simple_cuda<br />
<br />
<br />
Once we generate the executable, since they are for ARM, we will not be able to execute them until we move them on CARMA.<br />
<div>
<br /></div>
<br />
<br />
<br />Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com11tag:blogger.com,1999:blog-5417613544042855153.post-71062370380583580782011-11-12T09:17:00.000-08:002011-11-12T14:28:18.088-08:00MPI communications from GPU memory<div style="text-align: justify;">There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with <a href="http://mvapich.cse.ohio-state.edu/overview/mvapich2">CUDA support</a>.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:</div><div><ul><li>Transfer data from GPU memory to host memory on system A</li><li>Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv</li><li>Transfer data from host memory to GPU memory on system B</li></ul>could just issue the MPI_Send/Recv with the buffers located on GPU memory.</div><div>A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).</div><div><br /></div><div style="text-align: justify;">In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran. </div><div style="text-align: justify;"><br /></div><div>After downloading the <a href="http://mvapich.cse.ohio-state.edu/download/mvapich2/">tar file</a> from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.</div><div><br /></div><div>We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).</div><div><div><br /></div></div><div><br /></div><div><code></code></div><div><code> FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure </code></div><div><code>--prefix=/share/apps/mvapich2-gpu </code></div><div><code>--enable-cuda </code></div><div><code>--with-cuda-include=/usr/local/cuda/include </code></div><div><code>--with-cuda-libpath=/usr/local/cuda/lib64</code></div><div></div><div><br /></div><div>The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.</div><div><br /></div><div> We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.</div><div><br /></div><div><code></code></div><div><div><code>program mpi_test_gpu</code></div><div><code>use mpi</code></div><div><code>integer, allocatable:: a(:)</code></div><div><code>integer, device,allocatable:: a_d(:),b_d(:)</code></div><div><code>integer:: N, ierr, rank, num_procs, status(MPI_Status_size)</code></div><div><code><br /></code></div><div><code>call MPI_Init (ierr)</code></div><div><code>call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)</code></div><div><code>call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)</code></div><div><code><br /></code></div><div><code>N=4</code></div><div><code>allocate (a(N),a_d(N),b_d(N))</code></div><div><code>a_d=(rank+1)*10</code></div><div><code>b_d=(rank-1)*100</code></div><div><code><br /></code></div><div><code>a=-999</code></div><div><code>if ( rank == 0) then</code></div><div><code> call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)</code></div><div><code>else </code></div><div><code> call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)</code></div><div><code>end if</code></div><div><code><br /></code></div><div><code>if (rank == 1) a=b_d</code></div><div><code><br /></code></div><div><code>print *,"Rank=",rank,"A=",a</code></div><div><code><br /></code></div><div><code>deallocate (a,a_d,b_d)</code></div><div><code><br /></code></div><div><code>call MPI_Finalize ( ierr )</code></div><div><code>end program mpi_test_gpu</code></div></div><div></div><div><br /></div><div>If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:</div><div><br /></div><div><code></code></div><div><code>mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf</code></div><div></div><div><br /></div><div style="text-align: justify;">We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add </div><div style="text-align: justify;">export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).</div><div style="text-align: justify;">We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.</div><div><br /></div><div><code></code></div><div><div><code>mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu</code></div><div><code> Rank= 0 A= -999 -999 -999 -999</code></div><div><code> Rank= 1 A= 10 10 10 10</code></div></div><div></div><div><br /></div><div>As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.</div><div>MVAPICH also allows to send data from GPU to host memory and vice versa. </div><div>For example we could replace the lines:</div><div><br /></div><div><code></code></div><div><code>! Receive data to GPU array b_d from processor 0</code></div><div><div><code> call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)</code></div><div><code>...</code></div><div><code>! Copy GPU array b_d to CPU array a</code></div><div><code>if (rank == 1) a=b_d</code></div></div><div></div><div><br /></div><div>directly with</div><div><code></code></div><div><code>! Receive data to CPU array a from processor 0</code></div><div><div><code>call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)</code></div></div><div></div><div><br /></div>Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com2tag:blogger.com,1999:blog-5417613544042855153.post-28941636195680117952011-08-16T17:14:00.001-07:002011-08-16T19:03:42.091-07:00CUDA, MPI and Infiniband<p style="text-align: justify;">There is a lot of confusion around MPI codes that are using GPUs and Infiniband and what needs to be done to fix some problems occurring from the interaction of the CUDA runtime and Infiniband software stack ( OFED and MPI).</p><p style="text-align: justify;">Let's start with a simple program using 2 MPI processes that:</p><ul><li>allocate data on the CPU and GPU</li><li>initialize the data on the CPU</li><li>copy the data on the GPU</li><li>transfer the host data from one process to the other</li></ul><p style="text-align: justify;">The code is going to report the bandwidth of the transfer to the GPU and the bandwidth achieved by the network.</p><code>
<br />#include <stdio.h>
<br />#include <stdlib.h>
<br />#include <cuda.h>
<br />#include <cuda_runtime.h>
<br />#include <sys/time.h>
<br />#include <mpi.h>
<br />
<br />#define NREPEAT 10
<br />#define NBYTES 10.e6
<br />
<br />int main (int argc, char *argv[])
<br />{
<br />int rank, size, n, len, numbytes;
<br />void *a_h, *a_d;
<br />struct timeval time[2];
<br />double bandwidth;
<br />char name[MPI_MAX_PROCESSOR_NAME];
<br />MPI_Status status;
<br />
<br />MPI_Init (&argc, &argv);
<br />MPI_Comm_rank (MPI_COMM_WORLD, &rank);
<br />MPI_Comm_size (MPI_COMM_WORLD, &size);
<br />
<br />MPI_Get_processor_name(name, &len);
<br />printf("Process %d is on %s\n", rank, name);
<br />
<br />printf("Using regular memory \n");
<br />a_h = malloc(NBYTES);
<br />
<br />cudaMalloc( (void **) &a_d, NBYTES);
<br />
<br />/* Test host -> device bandwidth. */
<br />MPI_Barrier(MPI_COMM_WORLD);
<br />
<br />gettimeofday(&time[0], NULL);
<br />for (n=0; n<NREPEAT; n )
<br />{
<br />cudaMemcpy(a_d, a_h, NBYTES, cudaMemcpyHostToDevice);
<br />}
<br />gettimeofday(&time[1], NULL);
<br />
<br />bandwidth = time[1].tv_sec - time[0].tv_sec;
<br />bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
<br />bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
<br />
<br />printf("Host->device bandwidth for process %d: %f MB/sec\n",rank,bandwidth);
<br />
<br />/* Test MPI send/recv bandwidth. */
<br />MPI_Barrier(MPI_COMM_WORLD);
<br />
<br />gettimeofday(&time[0], NULL);
<br />for (n=0; n<NREPEAT; n )
<br />{
<br />if (rank == 0)
<br />MPI_Send(a_h, NBYTES/sizeof(int), MPI_INT, 1, 0, MPI_COMM_WORLD);
<br />else
<br />MPI_Recv(a_h, NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
<br />}
<br />gettimeofday(&time[1], NULL);
<br />
<br />bandwidth = time[1].tv_sec - time[0].tv_sec;
<br />bandwidth = 1.e-6*(time[1].tv_usec - time[0].tv_usec);
<br />bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
<br />
<br />if (rank == 0)
<br />printf("MPI send/recv bandwidth: %f MB/sec\n", bandwidth);
<br />
<br />cudaFree(a_d);
<br />free(a_h);
<br />
<br />MPI_Finalize();
<br />return 0;
<br />}<span class="Apple-style-span" style="font-family:Georgia, serif;font-size:130%;">
<br /></span></code><p></p><div style="text-align: justify;">
<br /></div><div style="text-align: justify;">Since there are no CUDA kernels, there is no need to use nvcc. We can use mpicc ( that for the moment we assume has been compiled with gcc), taking care of indicating the directories for the CUDA include files and CUDA libraries:</div><p></p><pre>mpicc -o mpi_malloc mpi_malloc.c -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart</pre><p style="text-align: justify;">Running this code on a cluster with nodes connected by QDR Infiniband adapters, will generate an output similar to this one:</p><pre>#mpirun -np 2 -host c0-0,c0-1 mpi_malloc
<br />Process 0 is on compute-0-0.local
<br />Using regular memory
<br />Process 1 is on compute-0-1.local
<br />Using regular memory
<br />Host->device bandwidth for process 0: 4699.248120 MB/sec
<br />Host->device bandwidth for process 1: 4323.950361 MB/sec
<br />MPI send/recv bandwidth: 2467.369044 MB/sec
<br /></pre>
<br /><p div="" style="text-align: justify;">Up to now, everything worked as expected. We were using a standard malloc to allocate the host memory. In order to improve the bandwidth of the PCI-e bus, and more important to allow overlap of transfers to/from the GPU with kernel executions, we would like to use pinned memory. The modifications to the previous code are minimal, the malloc calls need to be replaced by cudaMallocHost and the free calls by cudaFreeHost. After changing the code and recompiling, once we try to run it, we observe a problem. The code starts and from the initial prints we can see that the pinned memory is giving us an improvement in bandwidth, but it never completes.</p>
<br /><pre>#mpirun -np 2 -host c0-0,c0-1 mpi_pinned
<br />Process 1 is on compute-0-1.local
<br />Using pinned memory
<br />Process 0 is on compute-0-0.local
<br />Using pinned memory
<br />Host->device bandwidth for process 0: 5927.330923 MB/sec
<br />Host->device bandwidth for process 1: 5909.117769 MB/sec</pre>
<br /><p> If we attach a debugger to the process running on node c0-0, we will see that the code is stuck in MPI.
<br /></p>
<br /><pre><span class="Apple-style-span" style="font-family:Georgia, serif;font-size:130%;"><span class="Apple-style-span" style="white-space: normal;">
<br /></span></span>0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
<br />3175 btl_openib_component.c: No such file or directory.
<br /> in btl_openib_component.c
<br />(gdb) where
<br />#0 0x00002b517595fcc8 in btl_openib_component_progress () at btl_openib_component.c:3175
<br />#1 0x00002b5172536394 in opal_progress () at runtime/opal_progress.c:207
<br />#2 0x00002b51751335ce in mca_pml_ob1_send (buf=0x13365420, count=46912503140448, datatype=0x0, dst=1, tag=16000000,
<br />sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x6544a0) at pml_ob1_isend.c:125
<br />#3 0x00002b51720520b3 in PMPI_Send (buf=0x13365420, count=-1424633760, type=0x0, dest=1, tag=16000000, comm=0x0) at psend.c:72
<br />#4 0x0000000000404d1d in main () at ./mpi_pinned.c:69
<br /></pre>
<br /><p>
<br /></p><div style="text-align: justify;">Without going into details, the problem is caused by the way the CUDA runtime marks pages allocated with pinned memory and the way in which the Infiniband driver handles RDMA. At this point we have two solutions:</div><p></p><ol><li>Disable RDMA in MPI</li><li>Make the Infiniband driver and CUDA runtime compatible</li></ol>The first solution is very simple for OpenMPI, we just need to pass an additional flag ( -mca btl_openib_flags 1 )to mpirun at a cost of lower bandwidth for IB. Other MPI implementations will require a different switch or a recompilation with RDMA disabled<div>
<br /><pre>mpirun -np 2 -host c0-0,c0-1 -mca btl_openib_flags 1 mpi_pinned
<br />Process 1 is on compute-0-1.local
<br />Using pinned memory
<br />Process 0 is on compute-0-0.local
<br />Using pinned memory
<br />Host->device bandwidth for process 0: 5907.023451 MB/sec
<br />Host->device bandwidth for process 1: 5877.858109 MB/sec
<br />MPI send/recv bandwidth: 2713.041591 MB/sec
<br /></pre><p>
<br /></p><div style="text-align: justify;">Before CUDA 4.0, the second solution was to install GPU Direct, a patch for the Linux kernel and special NVIDIA and Mellanox drivers to eliminate the incompatibility. </div><div style="text-align: justify;">With CUDA 4.0 we have a new option. There is an environment variable that if set to 1 change the internal behavior of the pinned memory allocation in the CUDA driver, removing the source of incompatibility with the Infiniband driver. If we set CUDA_NIC_INTEROP to 1 ( for example adding the line "export CUDA_NIC_INTEROP=1" to our .bashrc file) , if we try to run again the pinned version, we will see that the code is able to complete and we also get a better bandwidth since RDMA is now working.<p></p>
<br /><pre>mpirun -np 2 -host c0-0,c0-1 mpi_pinned
<br />Process 0 is on compute-0-0.local
<br />Using pinned memory
<br />Process 1 is on compute-0-1.local
<br />Using pinned memory
<br />Host->device bandwidth for process 0: 5904.930617 MB/sec
<br />Host->device bandwidth for process 1: 5901.445854 MB/sec
<br />MPI send/recv bandwidth: 3150.300854 MB/sec
<br /></pre>
<br /><p> This solution works with all the MPI implementations out there and it is very simple to use. So, forget about GPU Direct 1.0 and use this new method!!!</p></div></div>Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com4tag:blogger.com,1999:blog-5417613544042855153.post-85857803125225561192011-06-02T11:59:00.001-07:002011-06-07T11:25:13.763-07:00Calling Thrust from CUDA Fortran<div style="text-align: justify;">CUDA 4.0 ships with the <a href="http://code.google.com/p/thrust/">Thrust</a> library, a standard template library for GPU that offers several useful algorithms ( sorting, prefix sum, reduction). In the previous post I explained how to configure CUDA Fortran to use the 4.0 toolkit. Now I am going to show how to call Thrust from CUDA Fortran, in particular how to sort an array.</div><div><br /></div><div style="text-align: justify;">On the <a href="http://code.google.com/p/thrust/">Thrust</a> web page, there are a lot of examples and documentation. The basic idea of Thrust is to have <i>containers</i>, that manage host and device memory and simplify data transfers, <i>iterators</i>, that<i> </i>act like pointers and keep track of memory spaces, and <i>algorithms, </i>that are applied to containers.</div><div><br /></div><div>This is a simple Thrust code to sort an array of random data.</div><div><br /></div><div><code></code></div><code><div><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/host_vector.h > </p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/device_vector.h> </p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/sort.h></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">int main(void) {</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">// define a vector of 16M int on the host</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">thrust::host_vector<int> h_vec(1 << 24); </int></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">// generate 16M random numbers on the host</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">thrust::generate(h_vec.begin(), h_vec.end(), rand);</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">// transfer data to the device</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">thrust::device_vector<int> d_vec = h_vec; </int></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">// sort data on the device</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">thrust::sort(d_vec.begin(), d_vec.end()); </p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">// transfer data back to host</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); </p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">return 0;</p></div></code><div></div><div><br /></div><div style="text-align: justify;">An important feature, that we will use to call Thrust from CUDA Fortran, is the conversion of Thrust objects to raw pointers or vice versa. This Thrust code snippet will convert a device container to a standard C pointer that we could use to call a CUDA C kernel:</div><div><br /></div><div><code></code></div><code><div><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 16px/normal 'Courier New'; ">// allocate device vector</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 16px/normal 'Courier New'; ">thrust::device_vector<int> d_vec(4);<b> </b></int></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 16px/normal 'Courier New'; "><b></b>// obtain raw pointer to device vector’s memory</p> <p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 16px/normal 'Courier New'; ">int * ptr = thrust::raw_pointer_cast(&d_vec[0]);</p></div></code><div></div><div><br /></div><div style="text-align: justify;">The basic idea is to write a wrapper to the Thrust algorithms that will handle standard C pointer and then use the Iso C Binding to call the wrapper. Since we want to sort an array, let's write a wrapper for the sort algorithm in Thrust.</div><div><br /></div><div><code></code></div><code><div>// Filename: csort.cu</div><div>// nvcc -c -arch sm_13 csort.cu </div><div><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/device_vector.h> </p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/device_vector.h></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">#include <thrust/sort.h></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b><br /></b></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; ">extern "C" {</p></b></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">//Sort for integer arrays</p></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">void sort_int_wrapper( int *data, int N)</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> {</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> // Wrap raw pointer with a device_ptr</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; "> thrust::device_ptr <int><int><int> dev_ptr(data);</int></int></p></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> // Use device_ptr in Thrust sort algorithm</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> thrust::sort(dev_ptr, dev_ptr+N);</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">}</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">//Sort for float arrays</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; "> void sort_float_wrapper( float *data, int N)</p></b><p></p></b></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> {</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> thrust::device_ptr <float><float><float> dev_ptr(data);</float></float></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> thrust::sort(dev_ptr, dev_ptr+N);</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> }</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b><b></b></b></p><b><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">//Sort for double arrays</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; "> void sort_double_wrapper( double *data, int N)</p></b><p></p></b></b></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> {</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> thrust::device_ptr <double><double><double> dev_ptr(data);</double></double></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><b></b></p><b><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; display: inline !important; "> thrust::sort(dev_ptr, dev_ptr+N);</p></b><p></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "> }</p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; "><br /></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 13.7px/normal 'Courier New'; ">}</p></b><p></p></div></code><div></div><div><br /></div><div>We can compile the code using </div><div><code></code></div><code><div> nvcc -c -arch sm_13 csort.cu </div></code><div></div><div>This will generate an object file, csort.o that we will use later on in the linking stage of the CUDA Fortran code.</div><div><br /></div><div>The other missing piece is the interface to these C functions.</div><div>We will define a generic interface thrustsort, that depending on the kind of data (integer, single precision or double precision) will call the correct sort function:</div><div><br /></div><div><div><code></code></div><code><div>module thrust</div><div><br /></div><div>interface thrustsort</div><div> subroutine sort_int( input,N) bind(C,name="sort_int_wrapper")</div><div> use iso_c_binding</div><div> integer(c_int),device:: input(*)</div><div> integer(c_int),value:: N</div><div> end subroutine</div><div><br /></div><div> subroutine sort_float( input,N) bind(C,name="sort_float_wrapper")</div><div> use iso_c_binding</div><div> real(c_float),device:: input(*)</div><div> integer(c_int),value:: N</div><div> end subroutine</div><div><br /></div><div> subroutine sort_double( input,N) bind(C,name="sort_double_wrapper")</div><div> use iso_c_binding</div><div> real(c_double),device:: input(*)</div><div> integer(c_int),value:: N</div><div> end subroutine</div><div><br /></div><div>end interface</div><div><br /></div><div>end module thrust</div></code><div></div></div><div><br /></div><div>At this point we have all we need to write the CUDA Fortran code:</div><div><br /></div><div><code></code></div><code><div><div>program testsort</div><div>use thrust</div><div>real, allocatable :: cpuData(:)</div><div>real, allocatable, device :: gpuData(:)</div><div>integer:: N=10</div><div>allocate(cpuData(N))</div><div>allocate(gpuData(N))</div><div><br /></div><div>do i=1,N</div><div> cpuData(i)=random(i)</div><div>end do</div><div>cpuData(5)=100.</div><div><br /></div><div>print *,"Before sorting", cpuData</div><div><br /></div><div>gpuData=cpuData</div><div><br /></div><div>call thrustsort(gpuData,size(gpuData))</div><div><br /></div><div>cpuData=gpuData</div><div><br /></div><div>print *,"After sorting", cpuData</div><div>end program</div></div></code><div></div><div><br /></div><div>If we save the module in a file module_thrust.cuf and the program in simplesort.cuf, we are ready to compile and execute:</div><div><br /></div><div><code></code></div><code><div><div>$ pgf90 -rc=rc4.0 -Mcuda=cc20 -O3 thrust_module.cuf sample_sort.cuf csort.o</div><div>thrust_module.cuf:</div><div>sample_sort.cuf:</div><div><br /></div><div>$ ./a.out </div><div> Before sorting 4.1630346E-02 0.9124327 0.7832350 0.6540373 </div><div> 100.0000 0.3956419 0.2664442 0.1372465 </div><div> 8.0488138E-03 0.8788511 </div><div><br /></div><div> After sorting 8.0488138E-03 4.1630346E-02 0.1372465 0.2664442 </div><div> 0.3956419 0.6540373 0.7832350 0.8788511 </div><div> 0.9124327 100.0000 </div></div></code><div></div><div><br /></div><div>The code is very simple:</div><div><ul><li>declare two arrays, cpuData and gpuData.</li><li>allocate them using the standard <i>allocate</i></li><li>copy cpuData from the host to gpuData on the GPU with a simple assignment</li><li>call the Thrust sort routine</li><li>copy sorted array back to the host</li><li>print the sorted array</li></ul></div><div>Now that we have verified that everything is working as expected, we can modify the code to do some timing using cudaEvents.</div><div><br /></div><div><code></code></div><code><div><div>program timesort</div><div>use cudafor</div><div>use thrust</div><div>implicit none</div><div>real, allocatable :: cpuData(:)</div><div>real, allocatable, device :: gpuData(:)</div><div>integer:: i,N=100000000</div><div><br /></div><div>! cuda events for elapsing</div><div>type ( cudaEvent ) :: startEvent , stopEvent</div><div>real :: time, random</div><div>integer :: istat</div><div><br /></div><div>! Create events</div><div>istat = cudaEventCreate ( startEvent )</div><div>istat = cudaEventCreate ( stopEvent )</div><div><br /></div><div>! Allocate arrays</div><div>allocate(cpuData(N))</div><div>allocate(gpuData(N))</div><div><br /></div><div>do i=1,N</div><div> cpuData(i)=random(i)</div><div>end do</div><div><br /></div><div>print *,"Sorting array of ",N, " single precision"</div><div><br /></div><div>gpuData=cpuData</div><div><br /></div><div>istat = cudaEventRecord ( startEvent , 0)</div><div>call thrustsort(gpuData,size(gpuData))</div><div><br /></div><div>istat = cudaEventRecord ( stopEvent , 0)</div><div>istat = cudaEventSynchronize ( stopEvent )</div><div>istat = cudaEventElapsedTime ( time , startEvent , stopEvent )</div><div><br /></div><div>cpuData=gpuData</div><div><br /></div><div>print *," Sorted array in:",time," (ms)"</div><div><br /></div><div>!Print the first five elements and the last five.</div><div>print *,"After sorting", cpuData(1:5),cpuData(N-4:N)</div><div>end program</div></div></code><div></div><div><br /></div><div>We can sort a vector of 100M elements in .222 second on a Tesla M2050 with ECC on when the data are resident in GPU memory.</div><div><br /></div><div><code></code></div><code><div><div>pgf90 -O3 -rc=rc4.0 -Mcuda=cc20 thrust_module.cuf time_sort.cuf csort.o -o time_sort</div><div>thrust_module.cuf:</div><div>time_sort.cuf:</div><div><br /></div><div>$ ./time_sort</div><div> Sorting array of 100000000 single precision</div><div> Sorted array in: 222.1711 (ms)</div><div> After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08 </div><div> 4.4078664E-08 0.9999999 0.9999999 1.000000 </div><div> 1.000000 1.000000 </div></div><div><div>./a.out </div><div> Sorting array of 100000000 single precision</div><div> Sorted array in: 225.0452 (ms)</div><div> After sorting 7.0585919E-09 1.0318221E-08 1.9398616E-08 3.1738640E-08 </div><div> 4.4078664E-08 0.9999999 0.9999999 0.9999999 </div><div> 1.000000 1.000000 </div></div></code><div></div>Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com2tag:blogger.com,1999:blog-5417613544042855153.post-29301399502563719682011-05-10T12:57:00.000-07:002011-05-10T13:06:07.757-07:00Using CUDA 4.0 from CUDA FortranIt is possible to use CUDA 4.0 RC2 with CUDA Fortran.<br /><br />Assuming that the CUDA 4.0 toolkit is installed in the location /usr/local/cuda, you will need to create a file rc4.0 containing the following lines:<br /><blockquote><br />set CUDAROOT=/usr/local/cuda;<br />set CUDAVERSION=4.0;<br /></blockquote><br /><br />When you compile your .cuf files, you will need to pass this rc file with the -rc flag and add the -L flag if you are using libraries from the 4.0 toolking<br /><br /><blockquote> pgf90 -rc=rc4.0 -Mcuda=cc20,nofma myfile.cuf -L/usr/local/cuda/lib64 -lcufft -lcurand</blockquote><br /><br />You can check if the compiler is picking up the new toolkit running ldd on the executable.Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com2tag:blogger.com,1999:blog-5417613544042855153.post-8188472489432925102010-07-19T22:11:00.000-07:002010-08-31T17:20:01.528-07:00Using zero copy from Fortran<div style="text-align: justify;">This time I am going to show how to use zero copy feature in CUDA C from a generic Fortran 90 compiler. Since I am not going to use CUDA Fortran, we will need to use the iso C bindings feature available in pretty much all the Fortran 90 compilers ( PGI, Intel, g95, gfortran just to cite a few).</div><br /><div style="text-align: justify;">The basic idea is to use the original CUDA C functions to allocate host arrays that are page-locked ( aka pinned) and with the right attributes to be used by the zero copy feature of CUDA. If you are not familiar with the zero-copy feature in CUDA C, it allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. There is no need to do cudaMemcpy.</div><br />To declare the mapped array, we will need to perform the following steps:<br /><ol><li><i>Set the device flag for mapping host memory:</i> this is achieved with a call to the cudaSetDeviceFlags with the flag cudaDeviceMapHost.</li><li><i>Allocate the host mapped arrays</i>: this is achieved with cudaHostAlloc with the flag cudaHostAllocMapped.</li><li><i>Get the device pointers to the mapped memory</i>. These are the pointers that we will pass to the CUDA kernels. This is achieved with calls to cudaHostGetDevicePointer.</li></ol><br />Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.<br /><div><br /></div><div>Let's start with a module that declares the interfaces to the CUDA runtime functions that we will need: cudaHostAlloc, cudaFree and cudaSetDeviceFlag</div><br /><code><br /><pre><br />!<br />! Module to interface the CUDA runtime functions<br />!<br /><br />module cuda_runtime<br /><br /> integer,parameter:: cudaHostAllocPortable=1, &<br /> cudaHostAllocMapped= 2, &<br /> cudaDeviceMapHost=8<br /><br /> interface<br />!<br />! cudaHostAlloc<br />!<br /> integer function cudaHostAlloc(buffer, size ,flag) bind(C,name="cudaHostAlloc") <br /> use iso_c_binding<br /> implicit none <br /> type (C_PTR) :: buffer <br /> integer (C_SIZE_T), value :: size<br /> integer (C_INT), value :: flag<br /> end function cudaHostAlloc<br />!<br />! cudaFreeHost<br />!<br /> integer function cudaFreeHost(buffer) bind(C,name="cudaFreeHost") <br /> use iso_c_binding <br /> implicit none <br /> type (C_PTR), value :: buffer<br /> end function cudaFreeHost<br />!<br />! cudaSetDeviceFlag<br />!<br /> integer function cudaSetDeviceFlags(flag) bind(C,name="cudaSetDeviceFlags")<br /> use iso_c_binding<br /> implicit none <br /> integer (C_INT), value :: flag<br /> end function cudaSetDeviceFlags<br /> <br /> end interface<br />end module cuda_runtime<br /></pre><br /></code><br /><br /><div style="text-align: justify;">Now that we have a working interface to the CUDA runtime, let's write a simple Fortran program that compute the exponential of each element of a double precision array, both on the CPU and GPU. </div><div style="text-align: justify;">A is the input array, C is the output array from the GPU computation. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. B is an array that we will use to compute a reference solution on the CPU. We will use the standard Fortran allocator for this one.</div><br /><code><br /><pre><br />!<br />! main.f90<br />!<br />program main<br /><br /> use iso_c_binding<br /> use cuda_runtime<br /> implicit none<br /><br /> integer, parameter :: fp_kind = kind(0.0d0) ! Double precision<br /> <br /> real(fp_kind) ,pointer, dimension (:) :: A,C<br /> real(fp_kind) ,allocatable, dimension (:) :: B<br /> type(C_PTR)::cptr_A,cptr_C<br /><br /> integer i, N, seed<br /> integer err<br /><br />! Number of elements in the arrays<br /> N=10<br /><br />! Initialize the random number generator<br /> seed=1<br /> call random_seed(seed)<br /><br />! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arrays<br /><br /> write(*,*)'Allocate host memory'<br /> err=cudaSetDeviceFlags(cudaDeviceMapHost)<br /> if (err > 0) print *,"Error in setting cudaSetDeviceFlags=",err<br /><br /> err = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped)<br /> if (err > 0) print *,"Error in allocating A with cuda HostAlloc =",err<br /> call c_f_pointer(cptr_A,A,(/N/))<br /><br /> err = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped)<br /> if (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err<br /> call c_f_pointer(cptr_C,C,(/N/))<br /><br />! From this point on, we can use A and C as normal Fortran array<br /><br /><br />! Allocate B using standard allocate call<br /> allocate(B(N))<br /><br />! Initialize A with random numbers<br /> call random_number(A)<br /><br /><br />! computing the reference solution on the CPU<br /> write(*,*)'computation on CPU'<br /> do i = 1, N<br /> B(i) = dexp(A(i))<br /> enddo<br /><br />! same computation on the GPU<br /> write(*,*)'computation on GPU'<br /> call gexp(A,C,N)<br /><br />! Print the computed quantities<br /> do i = 1, N<br /> write (*,'(i2,1x,4(g14.8))'),i,A(i),B(i),C(i),C(i)-B(i)<br /> enddo<br /><br />! Release memory<br /> deallocate(B)<br /> err = cudaFreeHost (cptr_A)<br /> err = cudaFreeHost (cptr_C)<br /><br />end program Main<br /></pre><br /></code><br />Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.<br /><code><br /><pre><br />/*<br /> kernel_code.cu<br />*/<br />#include <stdio.h><br /><br />// Device code<br />__global__ void CUDAexp(double* b, double* c, int N) {<br /> int index = threadIdx.x+blockDim.x*blockIdx.x;<br /> if( index < N) c[index] = exp(b[index]);<br />}<br /><br /><br />extern "C" void gexp_(double *a, double *d, int* N1)<br />{<br /> double *b,*c;<br /> int N=*N1;<br /> cudaError_t statusb,statusc,err;<br /><br /><br /> statusb=cudaHostGetDevicePointer((void **)&b, (void *) a, 0);<br /> statusc=cudaHostGetDevicePointer((void **)&c, (void *) d, 0);<br /><br /> if (statusb != 0 || statusc !=0) {<br /> printf("Error when locating memory to arrays on device!\n");<br /> printf("%s\n",cudaGetErrorString(statusb));<br /> printf("%s\n",cudaGetErrorString(statusc));<br /> }<br /><br /> // Cal the cuda kernel, just one block for this simple example.<br /> CUDAexp<<<1,N>>>(b,c, N);<br /><br /> err=cudaGetLastError();<br /> if(err != 0) printf("Error in kernel execution\n");<br /> <br /> // This is very important to retrieve the correct values<br /> cudaThreadSynchronize();<br />}<br /></pre><br /></code><br />Now that we have all the files, let's write a simple makefile<br /><code><br /><pre><br />all: TestZeroCopy<br /><br />TestZeroCopy: main.f90 kernel_code.o<br /> ifort -o TestZeroCopy main.f90 kernel_code.o -L/usr/local/cuda/lib64 -lcudart -lstdc++<br /><br />kernel_code.o: kernel_code.cu<br /> nvcc -c -O3 -arch sm_13 kernel_code.cu<br /><br />clean:<br /> rm kernel_code.o TestZeroCopy cuda_runtime.mod <br /></pre><br /></code><br /><br />Compiling and running the code, will show the following output:<br /><pre><br />$./TestZeroCopy <br /><br /> Allocate host memory<br /> computation on CPU<br /> computation on GPU<br /> 1 0.39208682E-06 1.0000004 1.0000004 0.0000000 <br /> 2 0.25480443E-01 1.0258078 1.0258078 0.0000000 <br /> 3 0.35251616 1.4226426 1.4226426 0.0000000 <br /> 4 0.66691448 1.9482168 1.9482168 0.0000000 <br /> 5 0.96305553 2.6196888 2.6196888 0.44408921E-15<br /> 6 0.83828820 2.3124052 2.3124052 -.44408921E-15<br /> 7 0.33535504 1.3984368 1.3984368 -.22204460E-15<br /> 8 0.91532720 2.4975923 2.4975923 0.0000000 <br /> 9 0.79586368 2.2163544 2.2163544 -.44408921E-15<br />10 0.83269314 2.2995033 2.2995033 0.44408921E-15<br /></pre>Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com1tag:blogger.com,1999:blog-5417613544042855153.post-47061613807706308802010-05-24T10:09:00.001-07:002010-05-24T10:35:15.316-07:00Calling CUFFT from Cuda FortranThis example shows how to call CUFFT from CUDA Fortran.<br />We are still going to use iso_c_binding to wrap the CUFFT functions, like we did for CUBLAS.<br /><br />There are few points to outline in the wrapper:<br /><br />CUFFT is using plans ( opaque object) to store information on the transforms and auxiliary array. We will treat a plan as integer in Fortran. The calls to create a plan and destroy a plan will generate all the proper information, the integer is just a pointer to the opaque object.<br /><br />CUFFT uses several constants ( CUFFT_C2C, CUFFT_FORWARD, just to name a few). Some of them are defined as hex numbers.<br />Remember that to express an hex number in Fortran, you need to remove the 0x prefix and use Z. <br />CUFFT_R2C=0x2a will be defined as CUFFT_R2C=Z'2a' in Fortran.<br /><br />To keep the code simple, we just show the wrapper for the creation and destruction of the plan ( cufftPlan1d and cufftDestroy) and for the execution of complex to complex transform both in single (cufftExecC2C) and double (cufftExecZ2Z) precision. Adding additional plan creations and execution is very simple.<br /><br /><code><br />!<br />! Define the INTERFACE to the NVIDIA CUFFT routines<br />!<br /><br />module cufft<br /><br /> integer, public :: CUFFT_FORWARD = -1<br /> integer, public :: CUFFT_INVERSE = 1<br /> integer, public :: CUFFT_R2C = Z'2a' ! Real to Complex (interleaved)<br /> integer, public :: CUFFT_C2R = Z'2c' ! Complex (interleaved) to Real<br /> integer, public :: CUFFT_C2C = Z'29' ! Complex to Complex, interleaved<br /> integer, public :: CUFFT_D2Z = Z'6a' ! Double to Double-Complex<br /> integer, public :: CUFFT_Z2D = Z'6c' ! Double-Complex to Double<br /> integer, public :: CUFFT_Z2Z = Z'69' ! Double-Complex to Double-Complex<br /><br /><br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />!<br />! cufftPlan1d(cufftHandle *plan, int nx,cufftType type,int batch)<br />!<br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br /><br /><br /> interface cufftPlan1d<br /> subroutine cufftPlan1d(plan, nx, type, batch) bind(C,name='cufftPlan1d') <br /> use iso_c_binding<br /> integer(c_int):: plan<br /> integer(c_int),value:: nx, batch,type<br /> end subroutine cufftPlan1d<br /> end interface cufftPlan1d<br /> <br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />!<br />! cufftDestroy(cufftHandle plan)<br />!<br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br /><br /> interface cufftDestroy<br /> subroutine cufftDestroy(plan) bind(C,name='cufftDestroy') <br /> use iso_c_binding<br /> integer(c_int),value:: plan<br /> end subroutine cufftDestroy<br /> end interface cufftDestroy<br /><br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />!<br />! cufftExecC2C(cufftHandle plan,<br />! cufftComplex *idata,<br />! cufftComplex *odata,<br />! int direction)<br />!<br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br /> <br /> interface cufftExecC2C<br /> subroutine cufftExecC2C(plan, idata, odata, direction) &<br /> & bind(C,name='cufftExecC2C') <br /> use iso_c_binding<br /> use precision<br /> integer(c_int),value:: direction<br /> integer(c_int),value:: plan<br /> complex(fp_kind),device:: idata(*),odata(*)<br /> end subroutine cufftExecC2C<br /> end interface cufftExecC2C<br /><br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />!<br />! cufftExecZ2Z(cufftHandle plan,<br />! cufftDoubleComplex *idata,<br />! cufftDoubleComplex *odata,<br />! int direction);<br />!<br />!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br /> interface cufftExecZ2Z<br /> subroutine cufftExecZ2Z(plan, idata, odata, direction) &<br /> & bind(C,name='cufftExecZ2Z') <br /> use iso_c_binding<br /> use precision<br /> integer(c_int),value:: direction<br /> integer(c_int),value:: plan<br /> complex(fp_kind),device:: idata(*),odata(*)<br /> end subroutine cufftExecZ2Z<br /> end interface cufftExecZ2Z<br /><br />end module cufft<br /><br /></code><br /><br />With the cufft wrapper and the precision module, we have all we need to write a simple program that perform a forward transform out of place, followed by an inverse transform in place. Since the output of CUFFT is not normilized, we should<br />see the final array equal to the initial one scaled by the lenght of the transform.<br /><br /><code><br />program fft_test<br />use precision<br />use cufft<br />complex(fp_kind) ,allocatable:: a(:),b(:)<br />complex(fp_kind),device,allocatable:: a_d(:),b_d(:)<br /><br />integer:: n<br />integer:: plan<br /><br />n=8<br /><br />! allocate arrays on the host<br />allocate (a(n),b(n))<br /><br />! allocate arrays on the device<br />allocate (a_d(n))<br />allocate (b_d(n))<br /><br />!initialize arrays on host<br />a=1<br /><br />!copy arrays to device<br />a_d=a<br /><br /><br />! Print initial array<br />print *, "Array A:"<br />print *, a<br /><br /><br /><br />! Initialize the plan<br /> call cufftPlan1D(plan,n,CUFFT_Z2Z,1)<br /><br />! Execute FFTs<br /> call cufftExecZ2Z(plan,a_d,b_d,CUFFT_FORWARD)<br /><br /> call cufftExecZ2Z(plan,b_d,b_d,CUFFT_INVERSE)<br /><br /><br />! Copy results back to host<br /> b=b_d<br /><br />! Print initial array<br />print *, "Array B"<br />print *, b<br /><br />!release memory on the host<br />deallocate (a,b)<br /><br />!release memory on the device<br />deallocate (a_d,b_d)<br /><br />! Destroy the plan<br /> call cufftDestroy(plan)<br /><br />end program fft_test<br /><br /></code><br /><br />To compile the new example, we will repeat what we did for the CUBLAS example. This time, instead of linking CUBLAS, we will link CUFFT.<br /><br /><code><br />pgf90 -Mcuda -o test_fft test_fft.cuf -L/usr/local/cuda/lib64 -lcufft<br /></code><br /><br />If we execute the code, we should see this output:<br /><verbatim><br />./test_fft<br /> <br />Array A:<br /> (1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000) <br /> (1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000) <br /> (1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000) <br /> (1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)<br /><br /> Array B<br /> (8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000) <br /> (8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000) <br /> (8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000) <br /> (8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)<br /><br /></verbatim><br /><br />As expected, the output is the input multiplied by the length of the transform ( 8 in this case).Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com6tag:blogger.com,1999:blog-5417613544042855153.post-28948802236584733332010-05-18T14:57:00.000-07:002010-05-18T15:31:11.013-07:00Calling CUBLAS from CUDA FortranThis is a simple example that shows how to call a CUBLAS function ( SGEMM or DGEMM) from CUDA Fortran.<br /><br /><br />Lets' start by defining a couple of modules that we will use in the example.<br />The first one defines the precision we are going to use<br /><br /><code><br />module precision<br />! Precision control<br /><br /> integer, parameter, public :: Single = kind(0.0) ! Single precision<br /> integer, parameter, public :: Double = kind(0.0d0) ! Double precision<br /><br /> integer, parameter, public :: fp_kind = Double<br /> !integer, parameter, public :: fp_kind = Single<br /><br />end module precision<br /></code><br /><br /><br />Selecting fp_kind Single or Double will allow us to use the same code for single and double precision.<br /><br />CUBLAS, a BLAS library for CUDA, has a C interface. We are going to use iso_c_binding and the interface construct to be able to call the functions in this library directly from Fortran.<br /><br /><code><br />module cublas<br />!<br />! Define the INTERFACE to the NVIDIA C code cublasSgemm and cublasDgemm<br />!<br /> interface cuda_gemm<br />!<br />! void cublasSgemm (char transa, char transb, int m, int n,<br />! int k, float alpha, const float *A, int lda,<br />! const float *B, int ldb, float beta, float *C, int ldc)<br />!<br /> subroutine cuda_sgemm(cta, ctb, m, n, k,&<br /> alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasSgemm')<br /> use iso_c_binding<br /> character(1,c_char),value :: cta, ctb<br /> integer(c_int),value :: m,n,k,lda,ldb,ldc<br /> real(c_float),value :: alpha,beta<br /> real(c_float), device, dimension(lda,*) :: A<br /> real(c_float), device, dimension(ldb,*) :: B<br /> real(c_float), device, dimension(ldc,*) :: C<br /> end subroutine cuda_sgemm<br /><br />!<br />! void cublasDgemm (char transa, char transb, int m, int n,<br />! int k, double alpha, const double *A, int lda,<br />! const double *B, int ldb, double beta, double *C, int ldc)<br />!<br /> subroutine cuda_dgemm(cta, ctb, m, n, k,&<br /> alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')<br /> use iso_c_binding<br /> character(1,c_char),value :: cta, ctb<br /> integer(c_int),value :: m,n,k,lda,ldb,ldc<br /> real(c_double),value :: alpha,beta<br /> real(c_double), device, dimension(lda,*) :: A<br /> real(c_double), device, dimension(ldb,*) :: B<br /> real(c_double), device, dimension(ldc,*) :: C<br /> end subroutine cuda_dgemm<br /><br /> end interface<br /><br />end module cublas<br /><br /></code><br /><br />At this point we have all we need to write a simple example that will allocate the matrices A, B and C on the CPU and GPU, initialize them on the CPU, copy the content to the GPU, where we will perform a call to the appropriate GEMM ( depending on the precision selected) and transfer the result back to the CPU.<br /><br /><code><br />program gemm_test<br />use precision<br />use cublas<br />real(fp_kind) ,allocatable:: a(:,:),b(:,:),c(:,:)<br />real(fp_kind),device,allocatable:: a_d(:,:),b_d(:,:),c_d(:,:)<br />real(fp_kind):: alpha,beta<br />integer:: n,m,k<br /><br />n=4<br />m=4<br />k=4<br />alpha=1._fp_kind<br />beta=2._fp_kind<br /><br />! allocate arrays on the host<br />allocate (a(m,k))<br />allocate (b(k,n))<br />allocate (c(m,n))<br /><br />! allocate arrays on the device<br />allocate (a_d(m,k))<br />allocate (b_d(k,n))<br />allocate (c_d(m,n))<br /><br />!initialize arrays on host<br />a=1<br />b=2<br />c=3<br /><br />!copy arrays to device<br />a_d=a<br />b_d=b<br />c_d=c<br /><br /><br />print *, "Matrix A:"<br />print *, a<br /><br />print *, "Matrix B:"<br />print *, b<br />print *, "Matrix C:"<br />print *, c<br /><br />call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)<br /><br />c=c_d <br />print *, "Matrix C = alpha A*B+ beta C"<br />print *, c<br /><br />!release memory on the host<br />deallocate (a,b,c)<br /><br />!release memory on the device<br />deallocate (a_d,b_d,c_d)<br /><br />end program gemm_test<br /><br /></code><br /><br />We will need to compile this code with the CUDA Fortran compiler from Portland Group.<br /><br />You should copy the code in a file test_gemm.cuf. It is important to use the right suffix, since we are using the device qualifier that is specific to CUDA Fortran. You can choose any name you want but you need to remember to use the .cuf suffix. <br /><br />We are now ready to compile. We could create a Makefile, but for this simple example we can just invoke the compiler from the command line. We need to use the -Mcuda flag and then give the location and the name of the library (cublas) we want to link against.<br /><br /><code><br /> pgf90 -Mcuda -o test_gemm test_gemm.cuf -L/usr/local/cuda/lib64 -lcublas<br /></code><br /><br />When you run the executable generated ( test_gemm), you should see an output similar to this one:<br /><br /><code><br />Matrix A:<br /> 1.000000000000000 1.000000000000000 1.000000000000000 <br /> 1.000000000000000 1.000000000000000 1.000000000000000 <br /> 1.000000000000000 1.000000000000000 1.000000000000000 <br /> 1.000000000000000 1.000000000000000 1.000000000000000 <br /> 1.000000000000000 1.000000000000000 1.000000000000000 <br /> 1.000000000000000 <br /> Matrix B:<br /> 2.000000000000000 2.000000000000000 2.000000000000000 <br /> 2.000000000000000 2.000000000000000 2.000000000000000 <br /> 2.000000000000000 2.000000000000000 2.000000000000000 <br /> 2.000000000000000 2.000000000000000 2.000000000000000 <br /> 2.000000000000000 2.000000000000000 2.000000000000000 <br /> 2.000000000000000 <br /> Matrix C:<br /> 3.000000000000000 3.000000000000000 3.000000000000000 <br /> 3.000000000000000 3.000000000000000 3.000000000000000 <br /> 3.000000000000000 3.000000000000000 3.000000000000000 <br /> 3.000000000000000 3.000000000000000 3.000000000000000 <br /> 3.000000000000000 3.000000000000000 3.000000000000000 <br /> 3.000000000000000 <br /> Matrix C = alpha A*B+ beta C<br /> 14.00000000000000 14.00000000000000 14.00000000000000 <br /> 14.00000000000000 14.00000000000000 14.00000000000000 <br /> 14.00000000000000 14.00000000000000 14.00000000000000 <br /> 14.00000000000000 14.00000000000000 14.00000000000000 <br /> 14.00000000000000 14.00000000000000 14.00000000000000 <br /></code><br /><br />If we want to rerun the code in single precision, we only need to select fp_kind=Single in the module precision and recompile.<br />The code has been written in such a way, that all the definitions are precision agnostic. Yes, Fortran 90 is quite powerful and elegant.Massimilianohttp://www.blogger.com/profile/06026735414532627847noreply@blogger.com1