Lets' start by defining a couple of modules that we will use in the example.
The first one defines the precision we are going to use
module precision
! Precision control
integer, parameter, public :: Single = kind(0.0) ! Single precision
integer, parameter, public :: Double = kind(0.0d0) ! Double precision
integer, parameter, public :: fp_kind = Double
!integer, parameter, public :: fp_kind = Single
end module precision
Selecting fp_kind Single or Double will allow us to use the same code for single and double precision.
CUBLAS, a BLAS library for CUDA, has a C interface. We are going to use iso_c_binding and the interface construct to be able to call the functions in this library directly from Fortran.
module cublas
!
! Define the INTERFACE to the NVIDIA C code cublasSgemm and cublasDgemm
!
interface cuda_gemm
!
! void cublasSgemm (char transa, char transb, int m, int n,
! int k, float alpha, const float *A, int lda,
! const float *B, int ldb, float beta, float *C, int ldc)
!
subroutine cuda_sgemm(cta, ctb, m, n, k,&
alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasSgemm')
use iso_c_binding
character(1,c_char),value :: cta, ctb
integer(c_int),value :: m,n,k,lda,ldb,ldc
real(c_float),value :: alpha,beta
real(c_float), device, dimension(lda,*) :: A
real(c_float), device, dimension(ldb,*) :: B
real(c_float), device, dimension(ldc,*) :: C
end subroutine cuda_sgemm
!
! void cublasDgemm (char transa, char transb, int m, int n,
! int k, double alpha, const double *A, int lda,
! const double *B, int ldb, double beta, double *C, int ldc)
!
subroutine cuda_dgemm(cta, ctb, m, n, k,&
alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')
use iso_c_binding
character(1,c_char),value :: cta, ctb
integer(c_int),value :: m,n,k,lda,ldb,ldc
real(c_double),value :: alpha,beta
real(c_double), device, dimension(lda,*) :: A
real(c_double), device, dimension(ldb,*) :: B
real(c_double), device, dimension(ldc,*) :: C
end subroutine cuda_dgemm
end interface
end module cublas
At this point we have all we need to write a simple example that will allocate the matrices A, B and C on the CPU and GPU, initialize them on the CPU, copy the content to the GPU, where we will perform a call to the appropriate GEMM ( depending on the precision selected) and transfer the result back to the CPU.
program gemm_test
use precision
use cublas
real(fp_kind) ,allocatable:: a(:,:),b(:,:),c(:,:)
real(fp_kind),device,allocatable:: a_d(:,:),b_d(:,:),c_d(:,:)
real(fp_kind):: alpha,beta
integer:: n,m,k
n=4
m=4
k=4
alpha=1._fp_kind
beta=2._fp_kind
! allocate arrays on the host
allocate (a(m,k))
allocate (b(k,n))
allocate (c(m,n))
! allocate arrays on the device
allocate (a_d(m,k))
allocate (b_d(k,n))
allocate (c_d(m,n))
!initialize arrays on host
a=1
b=2
c=3
!copy arrays to device
a_d=a
b_d=b
c_d=c
print *, "Matrix A:"
print *, a
print *, "Matrix B:"
print *, b
print *, "Matrix C:"
print *, c
call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)
c=c_d
print *, "Matrix C = alpha A*B+ beta C"
print *, c
!release memory on the host
deallocate (a,b,c)
!release memory on the device
deallocate (a_d,b_d,c_d)
end program gemm_test
We will need to compile this code with the CUDA Fortran compiler from Portland Group.
You should copy the code in a file test_gemm.cuf. It is important to use the right suffix, since we are using the device qualifier that is specific to CUDA Fortran. You can choose any name you want but you need to remember to use the .cuf suffix.
We are now ready to compile. We could create a Makefile, but for this simple example we can just invoke the compiler from the command line. We need to use the -Mcuda flag and then give the location and the name of the library (cublas) we want to link against.
pgf90 -Mcuda -o test_gemm test_gemm.cuf -L/usr/local/cuda/lib64 -lcublas
When you run the executable generated ( test_gemm), you should see an output similar to this one:
Matrix A:
1.000000000000000 1.000000000000000 1.000000000000000
1.000000000000000 1.000000000000000 1.000000000000000
1.000000000000000 1.000000000000000 1.000000000000000
1.000000000000000 1.000000000000000 1.000000000000000
1.000000000000000 1.000000000000000 1.000000000000000
1.000000000000000
Matrix B:
2.000000000000000 2.000000000000000 2.000000000000000
2.000000000000000 2.000000000000000 2.000000000000000
2.000000000000000 2.000000000000000 2.000000000000000
2.000000000000000 2.000000000000000 2.000000000000000
2.000000000000000 2.000000000000000 2.000000000000000
2.000000000000000
Matrix C:
3.000000000000000 3.000000000000000 3.000000000000000
3.000000000000000 3.000000000000000 3.000000000000000
3.000000000000000 3.000000000000000 3.000000000000000
3.000000000000000 3.000000000000000 3.000000000000000
3.000000000000000 3.000000000000000 3.000000000000000
3.000000000000000
Matrix C = alpha A*B+ beta C
14.00000000000000 14.00000000000000 14.00000000000000
14.00000000000000 14.00000000000000 14.00000000000000
14.00000000000000 14.00000000000000 14.00000000000000
14.00000000000000 14.00000000000000 14.00000000000000
14.00000000000000 14.00000000000000 14.00000000000000
If we want to rerun the code in single precision, we only need to select fp_kind=Single in the module precision and recompile.
The code has been written in such a way, that all the definitions are precision agnostic. Yes, Fortran 90 is quite powerful and elegant.
Just starting to get into CUDA fortran and wondering what sort of speed-ups you're getting by using CUBLAS?
ReplyDelete