CUDA Musing: Calling CUDA Fortran kernels from MATLAB

The latest MATLAB versions, starting from 2010b, have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:

__global__ void add(double * in, double a, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
in[idx] += a;
}
}

The full documentation is available at
http://www.mathworks.com/help/distcomp/executing-cuda-or-ptx-code-on-the-gpu.html
I am just going to summarize the required steps:

Generate a PTX file from the kernel source

nvcc -ptx -arch sm_20 add.cu

Construct the kernel object from the PTX file

k=parallel.gpu.CUDAKernel('add.ptx','add.cu');

Set up the block and grid configuration, for example 28 blocks of 256 threads each:

k.ThreadBlockSize=[256 1 1]
k.GridSize=[28 1 1]

Execute the kernel.

o = feval(k,rand(10,1),2.,10)
The gpu array o contains the output of the kernel

It is possible to do the same with CUDA Fortran.
First of all, we will need to rewrite the code in CUDA Fortran (shameless plug, if you want
to learn more about CUDA Fortran there is a very good book you can pre-order from Amazon,
"CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming"). This is the equivalent code :

attributes(global) subroutine add(a, b, N)
implicit none
double precision, intent(inout) :: a(*)
double precision, value :: b
integer , value :: N
integer :: i

i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if ( i <=N) a(i) = a(i)+b

end subroutine add

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.

When loading the PTX file generated by CUDA C, we were passing both the PTX file name and

the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel.

This is what we need to load the PTX file generated from CUDA Fortran.

kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ');

Once we have created the kernel object kf, the calling sequence is the same one we used before.
We will set up the block and grid configuration, for example 28 blocks of 256 threads each:

kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

and execute the kernel.

of = feval(kf,rand(10,1),2.,10)

This is the full sequence of the MATLAB code with a verbose output to check all the intermediate steps:

% Create a 1D array of doubles with 10 elements

i1=gpuArray(rand(10,1))

% Create the kernel object from the PTX file with explicit prototype
kf=parallel.gpu.CUDAKernel('addf.n001.ptx',' double *, double, int ')

% Set execution configuration
kf.ThreadBlockSize=[256 1 1]
kf.GridSize=[28 1 1]

% Execute the kernel
of=feval(kf,i1,10.,10)

An important point for the CUDA Fortran kernels is that you cannot use Fortran assumed-shape arguments, which require the compiler to build and pass the descriptor as an extra argument.

Now that we understand all the steps, let's move to something more complex and discuss few more points.

We are going to implement a kernel to compute the sum of an array using a single pass with atomic lock

( the implementation and accuracy of parallel sum are discussed in details in Chapter 5 of the before mentioned book).

The kernel is embedded in a module, since we are using a global variable for the lock.

There is no limitation in the number of elements that the routine can handle, aside from the fact that we are using 32 bit size integers

for the addressing , each thread will process multiple data if needed.

This is the code:

module sumgpu

  implicit none

  integer, parameter :: fp_kind = kind(0.0d0)

  integer, device::  lock=0

contains

  attributes(global) subroutine sum(input,totalsum,N)

    real(fp_kind), intent(in) :: input(N)

    real(fp_kind) :: totalsum(1)

    integer,value :: N

    real(fp_kind), shared, dimension(256) :: psum

    integer :: i,index, inext

    real(fp_kind) :: lsum

    index=threadIdx%x+(BlockIdx%x-1)*BlockDim%x

    lsum = 0._fp_kind

    do i=index,N,BlockDim%x*GridDim%x

       lsum = lsum+ input(i)

    end do

    ! Local reduction per block

    index=threadIdx%x

    psum(index)=lsum

    call syncthreads()

    inext=blockDim%x/2

    do while ( inext >=1 )

       if (index <=inext) psum(index)=psum(index)+psum(index+inext)

       inext = inext /2

       call syncthreads()

    end do

    ! Final reduction among block with atomic lock

    if (index == 1) then

       do while ( atomiccas(lock,0,1) == 1)

       end do

       totalsum(1)=totalsum(1)+psum(1)

       call threadfence()

       lock =0

    end if

  end subroutine sum

end module sumgpu

If we generate and load the module as seen before, we can observe the following:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.

When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.

While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int')

Error using handleKernelArgs (line 61)

Found more than one entry point in the PTX code.  Possible names are:

sumgpu_sum_

sumgpu_sum2_

In this case, we would have to modify the command syntax and add an extra argument at the end of the list that specify the entry point.

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 2

       NumRHSArguments: 3

         ArgumentTypes: {'inout double vector'  'inout double vector'  'in int32 scalar'}

The command now completes correctly. However, with the prototype signature we specified, the first array that in the original code was

with intent(in), since it is only an input to the subroutine is now marked as 'inout double vector'. This is not a major problem, but we will

need to remember when using the object in MATLAB to specify two vectors as output on the left hand side.

We can fix the problem, changing the prototype signature to:

>> kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int','sumgpu_sum_')

kf = 

  CUDAKernel with properties:

       ThreadBlockSize: [1 1 1]

    MaxThreadsPerBlock: 1024

              GridSize: [1 1 1]

      SharedMemorySize: 0

            EntryPoint: 'sumgpu_sum_'

    MaxNumLHSArguments: 1

       NumRHSArguments: 3

         ArgumentTypes: {'in double vector'  'inout double vector'  'in int32 scalar'}

where we have replaced the 'double *' to 'const double *' to reflect that the array is read-only.

We are now ready to run the code:

%Generate an array of 1024 elements on the CPU

a=rand(1024,1);

% Copy the array to a GPU array ag

ag=gpuArray(a);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[28 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel

disp('CUDA Fortran kernel:')

sumg=feval(kf,ag,sumg,1024)

% Recompute the sum using the intrinsic MATLAB function

disp('Intrinsic MATLAB sum on GPU:')

sum_matlab=sum(ag)

%Check the result

disp('Difference:')

sumg-sum_matlab

obtaining the following output:

CUDA Fortran kernel:

sumg =

509.2181

Intrinsic MATLAB sum on GPU:

sum_matlab =

509.2181

Difference:

ans =

Now that we are confident that the code is running properly and giving the correct results, we can do some performance testing.

We will generate 50 millions random number directly on the GPU and then compute their sum.

%Set up random number generation on the GPU

seed=0;

gpu_stream = parallel.gpu.RandStream('CombRecursive','Seed',seed);

parallel.gpu.RandStream.setGlobalStream(gpu_stream);

N=50000000;

%Generate the random numbers directly on the GPU

ag=gpuArray.randn(N,1);

%Generate the kernel object and setup the execution configuration

kf=parallel.gpu.CUDAKernel('sumSingleBlock.n001.ptx','const double *, double *, int');

kf.ThreadBlockSize=[256 1 1];

kf.GridSize=[128 1 1];

% Initialize the sum to zero

sumg=gpuArray.zeros(1,'double');

% Invoke the kernel and time the execution

tic;sumg=feval(kf,ag,sumg,N);toc

% Invoke the intrinsic sum and time the execution

tic;sum(ag);toc

The output indicates that this version is slightly faster than the native sum, that is however more convenient to use.

Elapsed time is 0.000357 seconds.

Elapsed time is 0.000393 seconds.

The real goal of using CUDA Fortran kernels is not to reimplement the intrinsic functions but to implement new capabilities or just re-use

standalone code that was already written in a very productive environment such as MATLAB.

4 comments:

StuSeptember 13, 2013 at 11:57 AM
Excellent.

Can you also extract a pointer (to data resident on the device) from a mxGPUArray object in a MEX file and pass to a function written in CUDA Fortran? Do the MEX and pgf90 toolchains work nicely together for CUDA codes?

The only reason I ask is that, while I agree that the cudaKernel interface as you describe is a lot more convenient than MEX files, I have noticed a significant performance penalty on the cudaKernel (with the *.ptx file) vs calling the same kernel on GPU data in a MEX file. (...and, of course, the resulting *.mexa64).

a. I don't really know why the performance is better with the *.mexa64...I just know that it is.

b. I wouldn't be surprised if it's just that I'm doing something wrong with the *.ptx file and the cudaKernel interface.

c. Do you notice any performance change in the *.ptx version compared to other codes that incorporate the same CUDA Fortran kernel?

Thanks again. Nice blog.
MassimilianoSeptember 13, 2013 at 10:58 PM
I think Matlab needs to JIT the PTX file to generate SASS assembler (the native code that is going to execute on the GPU).
Have you tried to call a warm-up kernel? I have noticed that the following invocations are much faster.
The computer I was using had an old GPU, but I just replaced it with a faster one and will do more performance testing.

The PGI toolchain and MEX don't play nice together, if I get something to work I will post my findings.
StuSeptember 14, 2013 at 6:58 AM
Hi. That might have some effect, but the performance difference (~2x) doesn't seem to change for codes that call the kernel even hundreds of thousands of times over the course of execution. I've always assumed that there is some extra amount of processing that is done to the code to incorporate it into a *.mexa64 as opposed to a *.ptx (and, as you say, the corresponding SASS assembler...i didn't know that; thank you for the insight...) but I don't pretend to know. Whatever the case, for the same kernel with the same thread grid configuration, I consistently see the ~2x performance hit which is enough for me to go through the extra effort to use the MEX interface once the kernel is thoroughly tested.
UnknownFebruary 16, 2017 at 2:35 PM
Hi, since TF v1.0 released yesterday, I'm looking forward to the new tutorial.
Thanks for this, You're helping a lot of people!

CUDA Musing

Wednesday, September 11, 2013

Calling CUDA Fortran kernels from MATLAB

4 comments:

Search This Blog

Followers

Blog Archive

About Me