CUDA Musing: 2010

Monday, July 19, 2010

Using zero copy from Fortran

This time I am going to show how to use zero copy feature in CUDA C from a generic Fortran 90 compiler. Since I am not going to use CUDA Fortran, we will need to use the iso C bindings feature available in pretty much all the Fortran 90 compilers ( PGI, Intel, g95, gfortran just to cite a few).

The basic idea is to use the original CUDA C functions to allocate host arrays that are page-locked ( aka pinned) and with the right attributes to be used by the zero copy feature of CUDA. If you are not familiar with the zero-copy feature in CUDA C, it allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. There is no need to do cudaMemcpy.

To declare the mapped array, we will need to perform the following steps:

Set the device flag for mapping host memory: this is achieved with a call to the cudaSetDeviceFlags with the flag cudaDeviceMapHost.
Allocate the host mapped arrays: this is achieved with cudaHostAlloc with the flag cudaHostAllocMapped.
Get the device pointers to the mapped memory. These are the pointers that we will pass to the CUDA kernels. This is achieved with calls to cudaHostGetDevicePointer.

Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.

Let's start with a module that declares the interfaces to the CUDA runtime functions that we will need: cudaHostAlloc, cudaFree and cudaSetDeviceFlag



!
! Module to interface the CUDA runtime functions
!

module cuda_runtime

        integer,parameter:: cudaHostAllocPortable=1, &
                            cudaHostAllocMapped= 2, &
                            cudaDeviceMapHost=8

  interface
!
! cudaHostAlloc
!
 integer  function cudaHostAlloc(buffer, size ,flag)  bind(C,name="cudaHostAlloc") 
  use iso_c_binding
  implicit none 
  type (C_PTR)  :: buffer 
  integer (C_SIZE_T), value :: size
  integer (C_INT), value :: flag
       end function cudaHostAlloc
!
! cudaFreeHost
!
  integer  function cudaFreeHost(buffer)  bind(C,name="cudaFreeHost") 
   use iso_c_binding 
   implicit none 
   type (C_PTR), value :: buffer
  end function cudaFreeHost
!
! cudaSetDeviceFlag
!
 integer  function cudaSetDeviceFlags(flag) bind(C,name="cudaSetDeviceFlags")
  use iso_c_binding
  implicit none 
  integer (C_INT), value :: flag
 end function cudaSetDeviceFlags
 
  end interface
end module cuda_runtime

Now that we have a working interface to the CUDA runtime, let's write a simple Fortran program that compute the exponential of each element of a double precision array, both on the CPU and GPU.

A is the input array, C is the output array from the GPU computation. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. B is an array that we will use to compute a reference solution on the CPU. We will use the standard Fortran allocator for this one.



!
! main.f90
!
program main

       use iso_c_binding
       use cuda_runtime
       implicit none

       integer, parameter :: fp_kind = kind(0.0d0) ! Double precision
 
       real(fp_kind) ,pointer, dimension (:) :: A,C
       real(fp_kind) ,allocatable, dimension (:) :: B
       type(C_PTR)::cptr_A,cptr_C

       integer i, N, seed
       integer err

!  Number of elements in the  arrays
       N=10

! Initialize the random number generator
       seed=1
       call random_seed(seed)

! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arrays

       write(*,*)'Allocate host memory'
       err=cudaSetDeviceFlags(cudaDeviceMapHost)
       if (err > 0) print *,"Error in setting cudaSetDeviceFlags=",err

       err = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped)
       if (err > 0) print *,"Error in allocating A with cuda HostAlloc =",err
       call c_f_pointer(cptr_A,A,(/N/))

       err = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped)
       if (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err
       call c_f_pointer(cptr_C,C,(/N/))

! From this point on, we can use A and C as normal Fortran array


! Allocate B using standard allocate call
       allocate(B(N))

! Initialize A with random numbers
        call random_number(A)


! computing  the reference solution on the CPU
       write(*,*)'computation on CPU'
               do i = 1, N
                       B(i) = dexp(A(i))
               enddo

!  same computation on the GPU
       write(*,*)'computation on GPU'
               call gexp(A,C,N)

! Print the computed quantities
        do i = 1, N
              write (*,'(i2,1x,4(g14.8))'),i,A(i),B(i),C(i),C(i)-B(i)
         enddo

! Release memory
       deallocate(B)
       err = cudaFreeHost (cptr_A)
       err = cudaFreeHost (cptr_C)

end program Main

Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.



/*
  kernel_code.cu
*/
#include <stdio.h>

// Device code
__global__ void CUDAexp(double* b, double* c, int N) {
        int index = threadIdx.x+blockDim.x*blockIdx.x;
        if( index < N) c[index] = exp(b[index]);
}


extern "C" void  gexp_(double *a, double *d, int* N1)
{
        double *b,*c;
        int N=*N1;
        cudaError_t statusb,statusc,err;


        statusb=cudaHostGetDevicePointer((void **)&b, (void *) a, 0);
        statusc=cudaHostGetDevicePointer((void **)&c, (void *) d, 0);

        if (statusb != 0 || statusc !=0) {
                printf("Error when locating memory to arrays on device!\n");
                printf("%s\n",cudaGetErrorString(statusb));
                printf("%s\n",cudaGetErrorString(statusc));
        }

     // Cal the cuda kernel, just one block for this simple example.
      CUDAexp<<<1,N>>>(b,c, N);

      err=cudaGetLastError();
      if(err != 0) printf("Error in kernel execution\n");
 
     // This is very important to retrieve the correct values
      cudaThreadSynchronize();
}

Now that we have all the files, let's write a simple makefile



all: TestZeroCopy

TestZeroCopy: main.f90  kernel_code.o
        ifort -o TestZeroCopy main.f90 kernel_code.o -L/usr/local/cuda/lib64 -lcudart -lstdc++

kernel_code.o: kernel_code.cu
        nvcc -c -O3 -arch sm_13 kernel_code.cu

clean:
        rm kernel_code.o TestZeroCopy cuda_runtime.mod

Compiling and running the code, will show the following output:


$./TestZeroCopy 

 Allocate host memory
 computation on CPU
 computation on GPU
 1 0.39208682E-06 1.0000004     1.0000004     0.0000000    
 2 0.25480443E-01 1.0258078     1.0258078     0.0000000    
 3 0.35251616     1.4226426     1.4226426     0.0000000    
 4 0.66691448     1.9482168     1.9482168     0.0000000    
 5 0.96305553     2.6196888     2.6196888    0.44408921E-15
 6 0.83828820     2.3124052     2.3124052    -.44408921E-15
 7 0.33535504     1.3984368     1.3984368    -.22204460E-15
 8 0.91532720     2.4975923     2.4975923     0.0000000    
 9 0.79586368     2.2163544     2.2163544    -.44408921E-15
10 0.83269314     2.2995033     2.2995033    0.44408921E-15

Monday, May 24, 2010

Calling CUFFT from Cuda Fortran

This example shows how to call CUFFT from CUDA Fortran.
We are still going to use iso_c_binding to wrap the CUFFT functions, like we did for CUBLAS.

There are few points to outline in the wrapper:

CUFFT is using plans ( opaque object) to store information on the transforms and auxiliary array. We will treat a plan as integer in Fortran. The calls to create a plan and destroy a plan will generate all the proper information, the integer is just a pointer to the opaque object.

CUFFT uses several constants ( CUFFT_C2C, CUFFT_FORWARD, just to name a few). Some of them are defined as hex numbers.
Remember that to express an hex number in Fortran, you need to remove the 0x prefix and use Z.
CUFFT_R2C=0x2a will be defined as CUFFT_R2C=Z'2a' in Fortran.

To keep the code simple, we just show the wrapper for the creation and destruction of the plan ( cufftPlan1d and cufftDestroy) and for the execution of complex to complex transform both in single (cufftExecC2C) and double (cufftExecZ2Z) precision. Adding additional plan creations and execution is very simple.


!
! Define the INTERFACE to the NVIDIA CUFFT routines
!

module cufft

 integer,  public :: CUFFT_FORWARD = -1
 integer,  public :: CUFFT_INVERSE = 1
 integer,  public :: CUFFT_R2C = Z'2a' ! Real to Complex (interleaved)
 integer,  public :: CUFFT_C2R = Z'2c' ! Complex (interleaved) to Real
 integer,  public :: CUFFT_C2C = Z'29' ! Complex to Complex, interleaved
 integer,  public :: CUFFT_D2Z = Z'6a' ! Double to Double-Complex
 integer,  public :: CUFFT_Z2D = Z'6c' ! Double-Complex to Double
 integer,  public :: CUFFT_Z2Z = Z'69' ! Double-Complex to Double-Complex


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftPlan1d(cufftHandle *plan, int nx,cufftType type,int batch)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


 interface cufftPlan1d
       subroutine cufftPlan1d(plan, nx, type, batch) bind(C,name='cufftPlan1d')    
       use iso_c_binding
       integer(c_int):: plan
       integer(c_int),value:: nx, batch,type
       end subroutine cufftPlan1d
 end interface cufftPlan1d
       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftDestroy(cufftHandle plan)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

 interface cufftDestroy
       subroutine cufftDestroy(plan) bind(C,name='cufftDestroy')    
       use iso_c_binding
       integer(c_int),value:: plan
       end subroutine cufftDestroy
 end interface cufftDestroy

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftExecC2C(cufftHandle plan,
!              cufftComplex *idata,
!              cufftComplex *odata,
!              int direction)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 
  interface cufftExecC2C
       subroutine cufftExecC2C(plan, idata, odata, direction) &
                  & bind(C,name='cufftExecC2C')    
       use iso_c_binding
       use precision
       integer(c_int),value:: direction
       integer(c_int),value:: plan
       complex(fp_kind),device:: idata(*),odata(*)
       end subroutine cufftExecC2C
  end interface cufftExecC2C

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
!    cufftExecZ2Z(cufftHandle plan,
!                 cufftDoubleComplex *idata,
!                 cufftDoubleComplex *odata,
!                 int direction);
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  interface cufftExecZ2Z
       subroutine cufftExecZ2Z(plan, idata, odata, direction) &
                  & bind(C,name='cufftExecZ2Z')    
       use iso_c_binding
       use precision
       integer(c_int),value:: direction
       integer(c_int),value:: plan
       complex(fp_kind),device:: idata(*),odata(*)
       end subroutine cufftExecZ2Z
  end interface cufftExecZ2Z

end module cufft

With the cufft wrapper and the precision module, we have all we need to write a simple program that perform a forward transform out of place, followed by an inverse transform in place. Since the output of CUFFT is not normilized, we should
see the final array equal to the initial one scaled by the lenght of the transform.


program fft_test
use precision
use cufft
complex(fp_kind)       ,allocatable:: a(:),b(:)
complex(fp_kind),device,allocatable:: a_d(:),b_d(:)

integer:: n
integer:: plan

n=8

! allocate arrays on the host
allocate (a(n),b(n))

! allocate arrays on the device
allocate (a_d(n))
allocate (b_d(n))

!initialize arrays on host
a=1

!copy arrays to device
a_d=a


! Print initial array
print *, "Array A:"
print *, a



! Initialize the plan
 call cufftPlan1D(plan,n,CUFFT_Z2Z,1)

! Execute FFTs
 call cufftExecZ2Z(plan,a_d,b_d,CUFFT_FORWARD)

 call cufftExecZ2Z(plan,b_d,b_d,CUFFT_INVERSE)


! Copy results back to host
 b=b_d

! Print initial array
print *, "Array B"
print *, b

!release memory on the host
deallocate (a,b)

!release memory on the device
deallocate (a_d,b_d)

! Destroy the plan
 call cufftDestroy(plan)

end program fft_test

To compile the new example, we will repeat what we did for the CUBLAS example. This time, instead of linking CUBLAS, we will link CUFFT.


pgf90 -Mcuda -o test_fft test_fft.cuf -L/usr/local/cuda/lib64 -lcufft

If we execute the code, we should see this output:

./test_fft

Array A:
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)

Array B
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)

As expected, the output is the input multiplied by the length of the transform ( 8 in this case).

Tuesday, May 18, 2010

Calling CUBLAS from CUDA Fortran

This is a simple example that shows how to call a CUBLAS function ( SGEMM or DGEMM) from CUDA Fortran.

Lets' start by defining a couple of modules that we will use in the example.
The first one defines the precision we are going to use


module precision
! Precision control

  integer, parameter, public :: Single = kind(0.0)  ! Single precision
  integer, parameter, public :: Double = kind(0.0d0) ! Double precision

  integer, parameter, public :: fp_kind = Double
  !integer, parameter, public :: fp_kind = Single

end module precision

Selecting fp_kind Single or Double will allow us to use the same code for single and double precision.

CUBLAS, a BLAS library for CUDA, has a C interface. We are going to use iso_c_binding and the interface construct to be able to call the functions in this library directly from Fortran.


module cublas
!
! Define the INTERFACE to the NVIDIA C code cublasSgemm and cublasDgemm
!
      interface cuda_gemm
!
!      void cublasSgemm (char transa, char transb, int m, int n,
!                        int k, float alpha, const float *A, int lda,
!                  const float *B, int ldb, float beta, float *C, int ldc)
!
       subroutine cuda_sgemm(cta, ctb, m, n, k,&
         alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasSgemm')
             use iso_c_binding
             character(1,c_char),value :: cta, ctb
             integer(c_int),value :: m,n,k,lda,ldb,ldc
             real(c_float),value :: alpha,beta
             real(c_float), device, dimension(lda,*) :: A
             real(c_float), device, dimension(ldb,*) :: B
             real(c_float), device, dimension(ldc,*) :: C
       end subroutine cuda_sgemm

!
!      void cublasDgemm (char transa, char transb, int m, int n,
!                        int k, double alpha, const double *A, int lda,
!                  const double *B, int ldb, double beta, double *C, int ldc)
!
       subroutine cuda_dgemm(cta, ctb, m, n, k,&
         alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')
             use iso_c_binding
             character(1,c_char),value :: cta, ctb
             integer(c_int),value :: m,n,k,lda,ldb,ldc
             real(c_double),value :: alpha,beta
             real(c_double), device, dimension(lda,*) :: A
             real(c_double), device, dimension(ldb,*) :: B
             real(c_double), device, dimension(ldc,*) :: C
       end subroutine cuda_dgemm

       end interface

end module cublas

At this point we have all we need to write a simple example that will allocate the matrices A, B and C on the CPU and GPU, initialize them on the CPU, copy the content to the GPU, where we will perform a call to the appropriate GEMM ( depending on the precision selected) and transfer the result back to the CPU.


program gemm_test
use precision
use cublas
real(fp_kind)       ,allocatable:: a(:,:),b(:,:),c(:,:)
real(fp_kind),device,allocatable:: a_d(:,:),b_d(:,:),c_d(:,:)
real(fp_kind):: alpha,beta
integer:: n,m,k

n=4
m=4
k=4
alpha=1._fp_kind
beta=2._fp_kind

! allocate arrays on the host
allocate (a(m,k))
allocate (b(k,n))
allocate (c(m,n))

! allocate arrays on the device
allocate (a_d(m,k))
allocate (b_d(k,n))
allocate (c_d(m,n))

!initialize arrays on host
a=1
b=2
c=3

!copy arrays to device
a_d=a
b_d=b
c_d=c


print *, "Matrix A:"
print *, a

print *, "Matrix B:"
print *, b
print *, "Matrix C:"
print *, c

call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)

c=c_d 
print *, "Matrix C = alpha A*B+ beta C"
print *, c

!release memory on the host
deallocate (a,b,c)

!release memory on the device
deallocate (a_d,b_d,c_d)

end program gemm_test

We will need to compile this code with the CUDA Fortran compiler from Portland Group.

You should copy the code in a file test_gemm.cuf. It is important to use the right suffix, since we are using the device qualifier that is specific to CUDA Fortran. You can choose any name you want but you need to remember to use the .cuf suffix.

We are now ready to compile. We could create a Makefile, but for this simple example we can just invoke the compiler from the command line. We need to use the -Mcuda flag and then give the location and the name of the library (cublas) we want to link against.


                pgf90 -Mcuda -o test_gemm test_gemm.cuf  -L/usr/local/cuda/lib64 -lcublas

When you run the executable generated ( test_gemm), you should see an output similar to this one:


Matrix A:
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000         1.000000000000000         1.000000000000000      
    1.000000000000000     
 Matrix B:
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000         2.000000000000000         2.000000000000000      
    2.000000000000000     
 Matrix C:
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000         3.000000000000000         3.000000000000000      
    3.000000000000000     
 Matrix C = alpha A*B+ beta C
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000      
    14.00000000000000         14.00000000000000         14.00000000000000

If we want to rerun the code in single precision, we only need to select fp_kind=Single in the module precision and recompile.
The code has been written in such a way, that all the definitions are precision agnostic. Yes, Fortran 90 is quite powerful and elegant.

CUDA Musing

Monday, July 19, 2010

Using zero copy from Fortran

Monday, May 24, 2010

Calling CUFFT from Cuda Fortran

Tuesday, May 18, 2010

Calling CUBLAS from CUDA Fortran

Search This Blog

Followers

Blog Archive

About Me