This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:

__global__ void add(double * in, double a, int N) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) {

in[idx] += a;

}

}

The full documentation is available at

http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html

I am just going to summarize the required steps:

http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html

I am just going to summarize the required steps:

- Generate a PTX file from the kernel source
*nvcc -ptx -arch sm_20 add.cu**Construct the kernel object from the PTX file**k=parallel.gpu.CUDAKernel('add.ptx','add.cu');*- Set up the block and grid configuration, for example 28 blocks of 256 threads each:
*k.ThreadBlockSize=[256 1 1]**k.GridSize=[28 1 1]*- Execute the kernel.
*o = feval(k,rand(10,1),2.,10)*- The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.

First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want

to learn more about CUDA Fortran there is a very good book you can pre-order from Amazon,

"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)

implicit none

double precision, intent(inout) :: a(*)

double precision, value :: b

integer , value :: N

integer :: i

i = threadIdx%x+(blockIdx%x-1)*blockDim%x

if ( i <=N) a(i) = a(i)+b

end subroutine add

flags to generate the PTX file:

pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.

In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.

When loading the PTX file generated by CUDA C, we were passing both the PTX file name and

the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel.

This is what we need to load the PTX file generated from CUDA Fortran.

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.

We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

- kf.ThreadBlockSize=[256 1 1]
- kf.GridSize=[28 1 1]

- of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements

i1=gpuArray(rand(10,1))

% Create the kernel object from the PTX file with explicit prototype

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')

% Set execution configuration

kf.ThreadBlockSize=[256 1 1]

kf.GridSize=[28 1 1]

kf.ThreadBlockSize=[256 1 1]

kf.GridSize=[28 1 1]

% Execute the kernel

of=feval(kf,i1,10.,10)

of=feval(kf,i1,10.,10)

An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.

Now that we understand all the steps, let's move to something more complex and discuss few more points.

We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock

( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).

The kernel is embedded in a module, since we are using a global variable for the lock.

There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers

for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

implicit none

integer, parameter :: fp_kind = kind(0.0d0)

integer, device:: lock=0

contains

attributes(global) subroutine sum(input,totalsum,N)

real(fp_kind), intent(in) :: input(N)

real(fp_kind) :: totalsum(1)

integer,value :: N

real(fp_kind), shared, dimension(256) :: psum

integer :: i,index, inext

real(fp_kind) :: lsum

index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

lsum = 0._fp_kind

do i=index,N,BlockDim%x*GridDim%x

lsum = lsum+ input(i)

end do

! Local reduction per block

index=threadIdx%x

psum(index)=lsum

call syncthreads()

inext=blockDim%x/2

do while ( inext >=1 )

if (index <=inext) psum(index)=psum(index)+psum(index+inext)

inext = inext /2

call syncthreads()

end do

! Final reduction among block with atomic lock

if (index == 1) then

do while ( atomiccas(lock,0,1) == 1)

end do

totalsum(1)=totalsum(1)+psum(1)

call threadfence()

lock =0

end if

end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf =

CUDAKernel with properties:

ThreadBlockSize: [1 1 1]

MaxThreadsPerBlock: 1024

GridSize: [1 1 1]

SharedMemorySize: 0

EntryPoint: 'sumgpu_sum_'

MaxNumLHSArguments: 2

NumRHSArguments: 3

ArgumentTypes: {'inout double vector' 'inout double vector' 'in int32 scalar'}

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.

When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.

While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

Error using handleKernelArgs (line 61)

Found more than one entry point in the PTX code. Possible names are:

sumgpu_sum_

sumgpu_sum2_

In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')

kf =

CUDAKernel with properties:

ThreadBlockSize: [1 1 1]

MaxThreadsPerBlock: 1024

GridSize: [1 1 1]

SharedMemorySize: 0

EntryPoint: 'sumgpu_sum_'

MaxNumLHSArguments: 2

NumRHSArguments: 3

ArgumentTypes: {'inout double vector' 'inout double vector' 'in int32 scalar'}

The command now completes correctly. However, with the prototype signature we specified, the first array that in the original code was

with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'. This is not a major problem, but we will

need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.

We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf =

CUDAKernel with properties:

ThreadBlockSize: [1 1 1]

MaxThreadsPerBlock: 1024

GridSize: [1 1 1]

SharedMemorySize: 0

EntryPoint: 'sumgpu_sum_'

MaxNumLHSArguments: 1

NumRHSArguments: 3

ArgumentTypes: {'in double vector' 'inout double vector' 'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only.

We are now ready to run the code:

%Generate an array of 1024 elements on the CPU

a=rand(1024,1);

% Copy the array to a GPU array ag

ag=gpuArray(a);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[28 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel

disp('CUDA Fortran kernel:')

sumg=feval(kf,ag,sumg,1024)

% Recompute the sum using the intrinsic MATLAB function

disp('Intrinsic MATLAB sum on GPU:')

sum_matlab=sum(ag)

%Check the result

disp('Difference:')

sumg-sum_matlab

obtaining the following output:

CUDA Fortran kernel:

sumg =

509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

509.2181

Difference:

ans =

0

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.

We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU

seed=0;

gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);

parallel.gpu.RandStream.setGlobalStream(gpu_stream);

N=50000000;

%Generate the random numbers directly on the GPU

ag=gpuArray.randn(N,1);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[128 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel and time the execution

tic;sumg=feval(kf,ag,sumg,N);toc

% Invoke the intrinsic sum and time the execution

tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.

Elapsed time is 0.000357 seconds.

Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use

standalone code that was already written in a very productive environment such as MATLAB.