Monday, May 24, 2010

Calling CUFFT from Cuda Fortran

This example shows how to call CUFFT from CUDA Fortran.
We are still going to use iso_c_binding to wrap the CUFFT functions, like we did for CUBLAS.

There are few points to outline in the wrapper:

CUFFT is using plans ( opaque object) to store information on the transforms and auxiliary array. We will treat a plan as integer in Fortran. The calls to create a plan and destroy a plan will generate all the proper information, the integer is just a pointer to the opaque object.

CUFFT uses several constants ( CUFFT_C2C, CUFFT_FORWARD, just to name a few). Some of them are defined as hex numbers.
Remember that to express an hex number in Fortran, you need to remove the 0x prefix and use Z.
CUFFT_R2C=0x2a will be defined as CUFFT_R2C=Z'2a' in Fortran.

To keep the code simple, we just show the wrapper for the creation and destruction of the plan ( cufftPlan1d and cufftDestroy) and for the execution of complex to complex transform both in single (cufftExecC2C) and double (cufftExecZ2Z) precision. Adding additional plan creations and execution is very simple.


!
! Define the INTERFACE to the NVIDIA CUFFT routines
!

module cufft

integer, public :: CUFFT_FORWARD = -1
integer, public :: CUFFT_INVERSE = 1
integer, public :: CUFFT_R2C = Z'2a' ! Real to Complex (interleaved)
integer, public :: CUFFT_C2R = Z'2c' ! Complex (interleaved) to Real
integer, public :: CUFFT_C2C = Z'29' ! Complex to Complex, interleaved
integer, public :: CUFFT_D2Z = Z'6a' ! Double to Double-Complex
integer, public :: CUFFT_Z2D = Z'6c' ! Double-Complex to Double
integer, public :: CUFFT_Z2Z = Z'69' ! Double-Complex to Double-Complex


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftPlan1d(cufftHandle *plan, int nx,cufftType type,int batch)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


interface cufftPlan1d
subroutine cufftPlan1d(plan, nx, type, batch) bind(C,name='cufftPlan1d')
use iso_c_binding
integer(c_int):: plan
integer(c_int),value:: nx, batch,type
end subroutine cufftPlan1d
end interface cufftPlan1d

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftDestroy(cufftHandle plan)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

interface cufftDestroy
subroutine cufftDestroy(plan) bind(C,name='cufftDestroy')
use iso_c_binding
integer(c_int),value:: plan
end subroutine cufftDestroy
end interface cufftDestroy

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftExecC2C(cufftHandle plan,
! cufftComplex *idata,
! cufftComplex *odata,
! int direction)
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

interface cufftExecC2C
subroutine cufftExecC2C(plan, idata, odata, direction) &
& bind(C,name='cufftExecC2C')
use iso_c_binding
use precision
integer(c_int),value:: direction
integer(c_int),value:: plan
complex(fp_kind),device:: idata(*),odata(*)
end subroutine cufftExecC2C
end interface cufftExecC2C

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!
! cufftExecZ2Z(cufftHandle plan,
! cufftDoubleComplex *idata,
! cufftDoubleComplex *odata,
! int direction);
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
interface cufftExecZ2Z
subroutine cufftExecZ2Z(plan, idata, odata, direction) &
& bind(C,name='cufftExecZ2Z')
use iso_c_binding
use precision
integer(c_int),value:: direction
integer(c_int),value:: plan
complex(fp_kind),device:: idata(*),odata(*)
end subroutine cufftExecZ2Z
end interface cufftExecZ2Z

end module cufft



With the cufft wrapper and the precision module, we have all we need to write a simple program that perform a forward transform out of place, followed by an inverse transform in place. Since the output of CUFFT is not normilized, we should
see the final array equal to the initial one scaled by the lenght of the transform.


program fft_test
use precision
use cufft
complex(fp_kind) ,allocatable:: a(:),b(:)
complex(fp_kind),device,allocatable:: a_d(:),b_d(:)

integer:: n
integer:: plan

n=8

! allocate arrays on the host
allocate (a(n),b(n))

! allocate arrays on the device
allocate (a_d(n))
allocate (b_d(n))

!initialize arrays on host
a=1

!copy arrays to device
a_d=a


! Print initial array
print *, "Array A:"
print *, a



! Initialize the plan
call cufftPlan1D(plan,n,CUFFT_Z2Z,1)

! Execute FFTs
call cufftExecZ2Z(plan,a_d,b_d,CUFFT_FORWARD)

call cufftExecZ2Z(plan,b_d,b_d,CUFFT_INVERSE)


! Copy results back to host
b=b_d

! Print initial array
print *, "Array B"
print *, b

!release memory on the host
deallocate (a,b)

!release memory on the device
deallocate (a_d,b_d)

! Destroy the plan
call cufftDestroy(plan)

end program fft_test



To compile the new example, we will repeat what we did for the CUBLAS example. This time, instead of linking CUBLAS, we will link CUFFT.


pgf90 -Mcuda -o test_fft test_fft.cuf -L/usr/local/cuda/lib64 -lcufft


If we execute the code, we should see this output:

./test_fft

Array A:
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)
(1.000000000000000,0.000000000000000) (1.000000000000000,0.000000000000000)

Array B
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)
(8.000000000000000,0.000000000000000) (8.000000000000000,0.000000000000000)



As expected, the output is the input multiplied by the length of the transform ( 8 in this case).

6 comments:

  1. this is all great stuff, massimiliano! thx, will try it out.

    ReplyDelete
  2. Thank you! This works and saved my day indeed. Bow to Massimiliano :)

    ReplyDelete
  3. Thanks. This is very helpful.

    ReplyDelete
  4. Why did you take n=8 and not any other value?What all other values can be used?

    ReplyDelete
  5. If you're going to use Z2Z, you may need to declare your A and B arrays as double complex

    complex(dp_kind), allocatable :: a(:), b(:)

    and the arguments in the cufftExecuteZ2Z interface as well

    complex(dp_kind),device:: idata(*),odata(*)

    Great post!

    ReplyDelete
  6. I am getting following error

    terminate called without an active exception

    ReplyDelete