Saturday, October 12, 2013

CUDA 5.5 and Xcode 5

The latest Xcode 5 update seems to have broken nvcc.
If you try to compile a  CUDA program, you will see a similar error:

%nvcc -c
clang: error: unsupported option '-dumpspecs'
clang: error: no input files

There is a simple workaround.

%nvcc -ccbin=/usr/bin/clang -c

A more convenient way of adding this, it is to define an alias for nvcc.
You can add this line to your .bash_profile

alias nvcc='nvcc -ccbin=/usr/bin/clang'

or just define it in your shell,

alias 'nvcc=nvcc -ccbin=/usr/bin/clang'

Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b,  have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code,,  that adds a scalar to a vector:

__global__ void add(double * in, double a, int N)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N) {
       in[idx] += a;

The full documentation is available at
I am just going to summarize the required steps:

  • Generate a PTX  file from the kernel source
    • nvcc -ptx -arch sm_20
  • Construct the kernel object from the PTX file
    • k=parallel.gpu.CUDAKernel('add.ptx','');
  • Set up the block and grid configuration, for example 28 blocks of 256 threads each:
    • k.ThreadBlockSize=[256 1 1]
    • k.GridSize=[28 1 1]
  • Execute the kernel.
    • o = feval(k,rand(10,1),2.,10)
    • The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about  CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
    implicit none
    double precision, intent(inout) :: a(*)
    double precision, value :: b
    integer , value :: N
    integer :: i

    i = threadIdx%x+(blockIdx%x-1)*blockDim%x
    if ( i <=N) a(i) = a(i)+b

 end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

               pgf90 -c -Mcuda=keepptx,cc20  addf.cuf
The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files,  you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:


and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

               !pgf90 -c -Mcuda=keepptx,cc20  addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.
When loading the PTX file generated by CUDA C, we were passing both the PTX file name and
the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel. 

This is what we need to load the PTX file generated from CUDA Fortran.

       kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

    • kf.ThreadBlockSize=[256 1 1]
    • kf.GridSize=[28 1 1]
and execute the kernel.
    • of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements
% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')
% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]
% Execute the kernel

An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.

Now that we understand all the steps, let's move to something more complex and discuss few more points.
We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock
( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).
The kernel is embedded in a module, since we are using a global variable for the lock. 
There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers 
for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none
  integer, parameter :: fp_kind = kind(0.0d0)
  integer, device::  lock=0


  attributes(global) subroutine sum(input,totalsum,N)
    real(fp_kind), intent(in) :: input(N)
    real(fp_kind) :: totalsum(1)
    integer,value :: N
    real(fp_kind), shared, dimension(256) :: psum
    integer :: i,index, inext
    real(fp_kind) :: lsum


    lsum = 0._fp_kind
    do i=index,N,BlockDim%x*GridDim%x
       lsum = lsum+ input(i)
    end do

    ! Local reduction per block

    call syncthreads()

    do while ( inext >=1 )
       if (index <=inext) psum(index)=psum(index)+psum(index+inext)
       inext = inext /2
       call syncthreads()
    end do

    ! Final reduction among block with atomic lock
    if (index == 1) then
       do while ( atomiccas(lock,0,1) == 1)
       end do
       call threadfence()
       lock =0
    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.
When  the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.
While this is not important when the module contains a single subroutine, it is  crucial for situations in which multiple entry points are defined. If the module had  multiple subroutines,  we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')
Error using handleKernelArgs (line 61)
Found more than one entry point in the PTX code.  Possible names are:

In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')
kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly.  However, with the  prototype signature we specified, the first array that in the original code was
with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'.  This is not a major problem, but we will
need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.
We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 1
       NumRHSArguments: 3
         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only. 
We are now ready to run the code:

%Generate an array of 1024 elements on the CPU
% Copy the array to a GPU array ag
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[28 1 1];
% Initialize the sum to zero
% Invoke the kernel
disp('CUDA Fortran kernel:')
% Recompute the sum using the intrinsic MATLAB function
disp('Intrinsic MATLAB sum on GPU:')
%Check the result

obtaining the following output:
CUDA Fortran kernel:

sumg =


Intrinsic MATLAB sum on GPU:

sum_matlab =



ans =


Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.
We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU
gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);
%Generate the random numbers directly on the GPU
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[128 1 1];
% Initialize the sum to zero
% Invoke the kernel and time the execution
% Invoke the intrinsic sum and time the execution

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.
Elapsed time is 0.000357 seconds.
Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use
standalone code that was already written in a very productive environment such as MATLAB.

Monday, July 15, 2013

Enabling CUDA Multi Process Service (MPS) with multiple GPUs.

(Edited 10/21/13 to use MPS control daemon instead of MPS server)

CUDA 5.5 has a new interesting feature, called CUDA Multi Process Service (MPS), for GPUs with compute capability 3.5.

CUDA MPS, formerly known as CUDA Proxy,  is a feature that allows multiple CUDA processes to share a single GPU context. NVIDIA officially supports configurations with a single GPU, but it is possible to run it  on systems with multiple GPUs creating the MPS servers manually.
This post will show how to enable this feature when multiple GPUs are present in a system.
It is an unsupported but working configuration.

The first thing to do it is to create a MPS control daemon for each GPU.
We will use CUDA_VISIBLE_DEVICES to select each GPU and create two directories in /tmp for each MPS control daemon.  one for the pipe, the other for the log. By default, CUDA MPS will try to create a log directory in /var/log, requiring the control daemon to be executed with root privileges. By selecting a log directory in /tmp ( or any other directory of your choice that is accessible from normal users), we don't need root privileges to start the control daemons.


# Number of gpus with compute_capability 3.5  per server

# Start the MPS server for each GPU
for ((i=0; i< $NGPUS; i++))

 mkdir /tmp/mps_$i
 mkdir /tmp/mps_log_$i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i
 nvidia-cuda-mps-control -d
end do

Once we have set up the control daemons  we need to point the CUDA executable we want to run to the right MPS control daemon.  This is done in a non-standard way.
Instead of using the CUDA_VISIBLE_DEVICES variable, as normally done with CUDA, we will need to set CUDA_VISIBLE_DEVICES to 0 and select the explicit MPS pipe we want to use by specifying the proper CUDA_MPS_PIPE_DIRECTORY.

To start two instances of a.out on GPU 0 using proxy, we will type:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0

The execution script is a little more complex if we are running a MPI application.
In this case, we will need to find a way to detect how many MPI processes are running on a node.
OpenMPI has a variable that will tell us this info, other MPI implementations offer similar environment

This script shows how to run local process 0 and 2 on GPU 0 and 1 and 3 on GPU 1.

#run script for MPI
case ${lrank} in
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1

    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_1

Once the execution is completed, we need to clean up the MPS control daemons  if other users are supposed to run on the system.


# Stop the MPS control daemon for each GPU and clean up /tmp

for ((i=0; i< $NGPUS; i++))
echo $i
 export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
 echo "quit" | nvidia-cuda-mps-control
 rm -fr /tmp/mps_$i
 rm -fr /tmp/mps_log_$i

The creation and clean-up could be combined in a single script.

Wednesday, November 21, 2012

Using Thrust on CARMA

Thurst is an excellent library for CUDA development.
Unfortunately, Thrust is not present in the CARMA Toolkit but it is easy to install.

On the x86 development system, we are going to pull down the latest source from Thrust using git.
If git is not installed, we can easily add to the system with:

  sudo apt-get install git

and then clone the git repository

  git clone

We are now ready to cross-compile. Remember that Thrust is a template library, everything is build from include files.
Using our standard Makefile, we just need to add the directory in which the Thrust include files are ( in this case /home/ubuntu/thrust). 
We also want to restrict the code generation to arch sm_21 ( the CARMA kit has a Q1000m GPU with 2.1 compute capabilities) to reduce the compilation time.
We are going to use one of the examples shipping with Thrust,

#  Makefile for cross-compile #
all : monte_carlo

NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

monte_carlo :
        $(NVCC)  -O3 -arch sm_21 -o monte_carlo -I$(THRUST_LOC)

        rm monte_carlo

Once we generate the executable, we can copy it on the CARMA 

  scp monte_carlo ubuntu@carma:~

and execute it. We will see the number pi printed with 2 digits ( 3.14).
If you want to see more digits, you can change the source code and set the precision to 6 instead of the original 2

  std::cout << std::setprecision(6);

Monday, October 29, 2012

Setting up a CARMA kit

I just received a brand new CARMA kit and I am going to post all the steps I did to get a working set-up.

Let's start with the x86 development system. I am using a virtual machine on my Mac as my development system.

I started by installing a fresh Ubuntu 11.04 distro and then proceed to :
  • Update the packages: 
    • sudo apt-get update
  • Install the basic developer tools: 
    • sudo apt-get install build-essential
  • Install the 32bit development libraries ( CARMA is 32bit ):
    • sudo apt-get install ia32-libs
  • Install the ARM cross compilers: 
    • sudo apt-get install gcc-4.5-arm-linux-gnueabi g++-4.5-arm-linux-gnueabi
  • Install Fortran for both x86 and ARM (real developers use Fortran....):
    • sudo apt-get install gfortran-4.5-*
  • Install the CUDA Toolkit (available from under the downloads tab): 
    • sudo sh
  • Edit .bashrc to add nvcc to the path. With your favorite editor add a line at the end of the file:
    • export PATH=/usr/local/cuda/bin:$PATH
  • Source the .bashrc to refresh the path ( it will be automatically executed the next time you login or open a terminal):
    • . .bashrc
We can check that nvcc is now in our path, invoking the compiler with the -V flag to check the version

max@ubuntu:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Tue_Jul_17_14:48:12_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

We are now ready to compile our first CUDA code, a comparison between multiplications on CPU and GPU.

#include "stdio.h"

__global__ void kernel(int i, float *d_n)
*d_n *= 1.02f;

void main(){
 float n = 1.0f, *d_n;
 float n_ref = 1.0f;
 int i;
 cudaMalloc((void **)&d_n, sizeof(float));
 for(i = 1; i <= 10; i++)
  cudaMemcpy(d_n, &n, sizeof(float), cudaMemcpyHostToDevice);
  kernel <<< 1, 1 >>> (i, d_n);
  cudaMemcpy(&n, d_n, sizeof(float), cudaMemcpyDeviceToHost);
  printf("%d\t\t%42.41f\t%42.41f\n", i, n,n_ref*=1.02f);

We are going to use a Makefile similar to the one posted in the previous blog.

max@ubuntu:~$ cat Makefile 
#  Makefile for cross-compile #
all : gpu_test

NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

gpu_test :
$(NVCC)  -o gpu_test 

rm gpu_test

When we type make, we should see a similar output

max@ubuntu:~$ make
/usr/local/cuda/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32  -o gpu_test 
/usr/lib/gcc/arm-linux-gnueabi/4.5.2/../../../../arm-linux-gnueabi/bin/ld: warning:, needed by /usr/arm-linux-gnueabi/lib//, not found (try using -rpath or -rpath-link)

Don't worry about the warning. This is caused by a bogus DT_NEEDED entry in the shared libgcc file /usr/arm-linux-gnueabi/lib/ "readelf -a" shows:
 0x00000001 (NEEDED) Shared library: []
Before we could use the machine for any real CUDA development, there is an extra step that we will need to perform.  The CUDA Toolkit is missing the ( it usually comes with the driver on the x86 platform, don't ask me why it was not included in the ARM toolkit), we will not be able to link any CUDA code before we bring this library to the x86. We will do this step once we have the CARMA up and running.

Unpack the CARMA, plugin keyboard and mouse, plus the HDMI cable in the middle connector.
Plug in the power and ethernet cable and you are ready to go.
The first boot may be slow, the system is building the NVIDIA driver. It is a blind boot, there is no console output until the GUI comes up, so you need to have a little bit of patience.

Once the CARMA system boots, it will auto-login and start a terminal. It should also pick up an IP address ( use ifconfig to find out the IP). The default username/password is ubuntu/ubuntu.

We are ready  to check if our cross-compilation worked. 
From inside the virtual machine, we will copy the file gpu_test to the CARMA ( ipconfig is reporting ):

   scp gpu_test ubuntu@ :~

Either from the CARMA terminal or from a remote shell, we can run gpu_test and check that the CPU and GPU results are the same.

ubuntu@tegra-ubuntu:~$ ./gpu_test 
1 1.01999998092651367187500000000000000000000 1.01999998092651367187500000000000000000000
2 1.04039990901947021484375000000000000000000 1.04039990901947021484375000000000000000000
3 1.06120789051055908203125000000000000000000 1.06120789051055908203125000000000000000000
4 1.08243203163146972656250000000000000000000 1.08243203163146972656250000000000000000000
5 1.10408067703247070312500000000000000000000 1.10408067703247070312500000000000000000000
6 1.12616229057312011718750000000000000000000 1.12616229057312011718750000000000000000000
7 1.14868545532226562500000000000000000000000 1.14868545532226562500000000000000000000000
8 1.17165911197662353515625000000000000000000 1.17165911197662353515625000000000000000000
9 1.19509232044219970703125000000000000000000 1.19509232044219970703125000000000000000000
10 1.21899414062500000000000000000000000000000 1.21899414062500000000000000000000000000000

The CARMA filesystem is quite bare, let's add few useful packages:
  • Install Fortran:
    • sudo apt-get install gfortran
We need to install OpenMPI from source, the default packages don't seem to work.
The latest source (1.6.2) has support for ARM, the installation is very simple but it will take a while.

Get the latest stable version 

unpack it ( tar xvfz openmpi-1.6.2.tar.gz) and change the directory ( cd openmpi-1.6.2  )

We are now ready to build and install
sudo make -j 4 install

Add /usr/local/bin to your PATH and /usr/local/lib to your LD_LIBRARY_PATH

Sunday, September 30, 2012

Compiling for CARMA

In few days, CARMA will be finally available to the general public. If you are not familiar with the CARMA project, it is the first ARM platform supporting CUDA.
It has a Tegra 3 with 4 cores and 2 GB of memory, ethernet, USB ports and a Quadro 1000M GPU (GF108 with 2 GB of memory, 96 CUDA cores, compute capability 2.1).
It has full OpenGL and CUDA support, but at the moment, no CUDA compiler.

The developer needs to cross-compile from a Linux x86 machine. This blog shows how easy it is to cross-compile once we follow some simple instructions. I strongly suggest that you start with an Ubuntu machine, the cross-compiler are easily available under this platform.

The first thing to do, it is to install the cross-compilers:

sudo apt-get install g++-arm-linux-gnueabi gcc-arm-linux-gnueabi

At this point, we will have the cross-compilers installed under  /usr/bin/arm-linux-gnueabi-gcc and  /usr/bin/arm-linux-gnueabi-g++.

The second step is to install the CUDA Toolkit for ARM on the x86. If you choose the default location,
the installer will create a directory /usr/local/cuda.

If you need to use other libraries for ARM, you will also need to copy the libraries and corresponding header files from CARMA to the x86 machine.  You can place them under /usr/local/arm_lib and /usr/local/arm_include or you can just put them under /usr/local/cuda/lib and /usr/local/cuda/include (my preference will be for the first option to not pollute the CUDA installation).

We are now ready to compile our code, taking care of using the cross compiler and the special nvcc in the CARMA toolkit.  The following makefile will show how to compile a simple c++ code that calls a CUBLAS function and a simple CUDA code.

#  Makefile for cross-compile #
all : dgemm_cublas simple_cuda

NVCC=$(CUDA_HOME)/bin/nvcc -target-cpu-arch ARM --compiler-bindir /usr/bin/arm-linux-gnueabi-gcc-4.5 -m32

# For a standard c++ code, we use CC and the CUDA ARM libraries
dgemm_cublas : gemm_test.cpp
$(CC)   gemm_test.cpp -I$(CUDA_HOME)/include -o dgemm_cublas -L/$(CUDA_HOME)/lib -lcudart -lcublas

# For a standard CUDA code, we just invoke nvcc
$(NVCC) -o simple_cuda

clean :
rm -f *.o dgemm_cublas simple_cuda

Once we generate the executable, since they are for ARM, we will not be able to execute them until we move them on CARMA.

Saturday, November 12, 2011

MPI communications from GPU memory

There are several groups working on MPI implementations capable of transferring data directly from GPU memory, as a result of the introduction of the Unified Virtual Addressing (UVA) in CUDA 4.0. The MVAPICH group is the first one to officially release a version with CUDA support.

Being able to pass directly GPU pointers to MPI functions, greatly simplify the programming on clusters. For example, if the programmer needs to send data from GPU on system A to another GPU on system B, instead of the sequence:
  • Transfer data from GPU memory to host memory on system A
  • Transfer data from host memory on system A to host memory on system B, for example using MPI_Send/Recv
  • Transfer data from host memory to GPU memory on system B
could just issue the MPI_Send/Recv with the buffers located on GPU memory.
A GPU-aware MPI stack is also capable of optimizing the transfers under the hood via pipelining ( this could be explicitly programmed too, but having the library taking care of it is much more convenient).

In this blog, I am going to explain how to use the CUDA-enabled MVAPICH from CUDA Fortran.

After downloading the tar file from the MVAPICH web site, we need to configure the installation. Due to compatibility issues between CUDA and the PGI C compiler, we are going to use gcc for the C compiler and PGI Fortran for the Fortran one.

We need to specify the location of the CUDA include files and libraries ( in this case, they are located in the standard location /usr/local/cuda ) and the path for MVAPICH ( I am installing on a cluster where all the apps are located in /share/apps).

FC=pgfortran F77=pgfortran FCFLAGS=-fast FFLAGS=-fast ./configure

The next steps are to run "make" and then "make install" ( for this last step, depending on the location of the installed software, you may need to have root privileges). You will also need to add the location of the binaries ( in this case /share/apps/mvapich2-gpu/bin ) to your path.

We are now ready to write a CUDA Fortran code that uses MPI to transfer data between two GPUs. Each process initializes two arrays a_d and b_d, fill them with some values depending on the rank. Then, processor 0 sends a_d to processor 1. After 1 receives the data in b_d transfer the results back to the host array a and print the values.

program mpi_test_gpu
use mpi
integer, allocatable:: a(:)
integer, device,allocatable:: a_d(:),b_d(:)
integer:: N, ierr, rank, num_procs, status(MPI_Status_size)

call MPI_Init (ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, num_procs, ierr)

allocate (a(N),a_d(N),b_d(N))

if ( rank == 0) then
call MPI_Send(a_d,N,MPI_INT,1,0,MPI_COMM_WORLD, ierr)
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
end if

if (rank == 1) a=b_d

print *,"Rank=",rank,"A=",a

deallocate (a,a_d,b_d)

call MPI_Finalize ( ierr )
end program mpi_test_gpu

If the code is in a file with name mpi_test_gpu.cuf, we can generate an executable with the following command:

mpif90 -O3 -o mpi_test_gpu mpi_test_gpu.cuf

We are now ready to run with the command mpirun_rsh. We need to pass a special flag, MV2_USE_CUDA=1, to enable the new GPU path ( or you can add
export MV2_USE_CUDA=1 to your .bashrc to avoid to type it every time).
We are going to use two nodes, c0-0 and c0-1, connected by Infiniband.

mpirun_rsh -np 2 c0-0 c0-1 MV2_USE_CUDA=1 ./mpi_test_gpu
Rank= 0 A= -999 -999 -999 -999
Rank= 1 A= 10 10 10 10

As expected, rank 1 contains the values 10, that was the value initially stored in a_d on rank 0.
MVAPICH also allows to send data from GPU to host memory and vice versa.
For example we could replace the lines:

! Receive data to GPU array b_d from processor 0
call MPI_Recv(b_d,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)
! Copy GPU array b_d to CPU array a
if (rank == 1) a=b_d

directly with
! Receive data to CPU array a from processor 0
call MPI_Recv(a,N,MPI_INT,0,0,MPI_COMM_WORLD,status, ierr)