Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b,  have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu,  that adds a scalar to a vector:


__global__ void add(double * in, double a, int N)  {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N) {
       in[idx] += a;
   }
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

  • Generate a PTX  file from the kernel source
    • nvcc -ptx -arch sm_20 add.cu
  • Construct the kernel object from the PTX file
    • k=parallel.gpu.CUDAKernel('add.ptx','add.cu');
  • Set up the block and grid configuration, for example 28 blocks of 256 threads each:
    • k.ThreadBlockSize=[256 1 1]
    • k.GridSize=[28 1 1]
  • Execute the kernel.
    • o = feval(k,rand(10,1),2.,10)
    • The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about  CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
    implicit none
    double precision, intent(inout) :: a(*)
    double precision, value :: b
    integer , value :: N
    integer :: i

    i = threadIdx%x+(blockIdx%x-1)*blockDim%x
    if ( i <=N) a(i) = a(i)+b

 end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

               pgf90 -c -Mcuda=keepptx,cc20  addf.cuf
The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files,  you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

               setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

               !pgf90 -c -Mcuda=keepptx,cc20  addf.cuf


In order to load the PTX file in MATLAB, we need to slightly change the syntax.
When loading the PTX file generated by CUDA C, we were passing both the PTX file name and
the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel. 

This is what we need to load the PTX file generated from CUDA Fortran.

       kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

    • kf.ThreadBlockSize=[256 1 1]
    • kf.GridSize=[28 1 1]
and execute the kernel.
    • of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements
i1=gpuArray(rand(10,1))
% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')
% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]
% Execute the kernel
of=feval(kf,i1,10.,10)


An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.


Now that we understand all the steps, let's move to something more complex and discuss few more points.
We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock
( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).
The kernel is embedded in a module, since we are using a global variable for the lock. 
There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers 
for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none
  integer, parameter :: fp_kind = kind(0.0d0)
  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)
    real(fp_kind), intent(in) :: input(N)
    real(fp_kind) :: totalsum(1)
    integer,value :: N
    real(fp_kind), shared, dimension(256) :: psum
    integer :: i,index, inext
    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind
    do i=index,N,BlockDim%x*GridDim%x
       lsum = lsum+ input(i)
    end do

    ! Local reduction per block
    index=threadIdx%x

    psum(index)=lsum
    call syncthreads()

    inext=blockDim%x/2
    do while ( inext >=1 )
       if (index <=inext) psum(index)=psum(index)+psum(index+inext)
       inext = inext /2
       call syncthreads()
    end do

    ! Final reduction among block with atomic lock
    if (index == 1) then
       do while ( atomiccas(lock,0,1) == 1)
       end do
       totalsum(1)=totalsum(1)+psum(1)
       call threadfence()
       lock =0
    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}


The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.
When  the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.
While this is not important when the module contains a single subroutine, it is  crucial for situations in which multiple entry points are defined. If the module had  multiple subroutines,  we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')
Error using handleKernelArgs (line 61)
Found more than one entry point in the PTX code.  Possible names are:
sumgpu_sum_
sumgpu_sum2_


In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')
kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 2
       NumRHSArguments: 3
         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly.  However, with the  prototype signature we specified, the first array that in the original code was
with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'.  This is not a major problem, but we will
need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.
We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]
    MaxThreadsPerBlock: 1024
              GridSize: [1 1 1]
      SharedMemorySize: 0
            EntryPoint: 'sumgpu_sum_'
    MaxNumLHSArguments: 1
       NumRHSArguments: 3
         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only. 
We are now ready to run the code:

%Generate an array of 1024 elements on the CPU
a=rand(1024,1);
% Copy the array to a GPU array ag
ag=gpuArray(a);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[28 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel
disp('CUDA Fortran kernel:')
sumg=feval(kf,ag,sumg,1024)
% Recompute the sum using the intrinsic MATLAB function
disp('Intrinsic MATLAB sum on GPU:')
sum_matlab=sum(ag)
%Check the result
disp('Difference:')
sumg-sum_matlab

obtaining the following output:
CUDA Fortran kernel:

sumg =

  509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

  509.2181

Difference:

ans =

     0

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.
We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU
seed=0;
gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);
parallel.gpu.RandStream.setGlobalStream(gpu_stream);
N=50000000;
%Generate the random numbers directly on the GPU
ag=gpuArray.randn(N,1);
%Generate the kernel object and setup the execution configuration
kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');
kf.ThreadBlockSize=[256 1 1];
kf.GridSize=[128 1 1];
% Initialize the sum to zero
sumg=gpuArray.zeros(1,'double');
% Invoke the kernel and time the execution
tic;sumg=feval(kf,ag,sumg,N);toc
% Invoke the intrinsic sum and time the execution
tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.
Elapsed time is 0.000357 seconds.
Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use
standalone code that was already written in a very productive environment such as MATLAB.




3 comments:

  1. Excellent.

    Can you also extract a pointer (to data resident on the device) from a mxGPUArray object in a MEX file and pass to a function written in CUDA Fortran? Do the MEX and pgf90 toolchains work nicely together for CUDA codes?

    The only reason I ask is that, while I agree that the cudaKernel interface as you describe is a lot more convenient than MEX files, I have noticed a significant performance penalty on the cudaKernel (with the *.ptx file) vs calling the same kernel on GPU data in a MEX file. (...and, of course, the resulting *.mexa64).

    a. I don't really know why the performance is better with the *.mexa64...I just know that it is.

    b. I wouldn't be surprised if it's just that I'm doing something wrong with the *.ptx file and the cudaKernel interface.

    c. Do you notice any performance change in the *.ptx version compared to other codes that incorporate the same CUDA Fortran kernel?

    Thanks again. Nice blog.

    ReplyDelete
  2. I think Matlab needs to JIT the PTX file to generate SASS assembler (the native code that is going to execute on the GPU).
    Have you tried to call a warm-up kernel? I have noticed that the following invocations are much faster.
    The computer I was using had an old GPU, but I just replaced it with a faster one and will do more performance testing.

    The PGI toolchain and MEX don't play nice together, if I get something to work I will post my findings.

    ReplyDelete
  3. Hi. That might have some effect, but the performance difference (~2x) doesn't seem to change for codes that call the kernel even hundreds of thousands of times over the course of execution. I've always assumed that there is some extra amount of processing that is done to the code to incorporate it into a *.mexa64 as opposed to a *.ptx (and, as you say, the corresponding SASS assembler...i didn't know that; thank you for the insight...) but I don't pretend to know. Whatever the case, for the same kernel with the same thread grid configuration, I consistently see the ~2x performance hit which is enough for me to go through the extra effort to use the MEX interface once the kernel is thoroughly tested.

    ReplyDelete