Monday, July 19, 2010

Using zero copy from Fortran

This time I am going to show how to use zero copy feature in CUDA C from a generic Fortran 90 compiler. Since I am not going to use CUDA Fortran, we will need to use the iso C bindings feature available in pretty much all the Fortran 90 compilers ( PGI, Intel, g95, gfortran just to cite a few).

The basic idea is to use the original CUDA C functions to allocate host arrays that are page-locked ( aka pinned) and with the right attributes to be used by the zero copy feature of CUDA. If you are not familiar with the zero-copy feature in CUDA C, it allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. There is no need to do cudaMemcpy.

To declare the mapped array, we will need to perform the following steps:
  1. Set the device flag for mapping host memory: this is achieved with a call to the cudaSetDeviceFlags with the flag cudaDeviceMapHost.
  2. Allocate the host mapped arrays: this is achieved with cudaHostAlloc with the flag cudaHostAllocMapped.
  3. Get the device pointers to the mapped memory. These are the pointers that we will pass to the CUDA kernels. This is achieved with calls to cudaHostGetDevicePointer.

Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.

Let's start with a module that declares the interfaces to the CUDA runtime functions that we will need: cudaHostAlloc, cudaFree and cudaSetDeviceFlag



!
! Module to interface the CUDA runtime functions
!

module cuda_runtime

integer,parameter:: cudaHostAllocPortable=1, &
cudaHostAllocMapped= 2, &
cudaDeviceMapHost=8

interface
!
! cudaHostAlloc
!
integer function cudaHostAlloc(buffer, size ,flag) bind(C,name="cudaHostAlloc")
use iso_c_binding
implicit none
type (C_PTR) :: buffer
integer (C_SIZE_T), value :: size
integer (C_INT), value :: flag
end function cudaHostAlloc
!
! cudaFreeHost
!
integer function cudaFreeHost(buffer) bind(C,name="cudaFreeHost")
use iso_c_binding
implicit none
type (C_PTR), value :: buffer
end function cudaFreeHost
!
! cudaSetDeviceFlag
!
integer function cudaSetDeviceFlags(flag) bind(C,name="cudaSetDeviceFlags")
use iso_c_binding
implicit none
integer (C_INT), value :: flag
end function cudaSetDeviceFlags

end interface
end module cuda_runtime



Now that we have a working interface to the CUDA runtime, let's write a simple Fortran program that compute the exponential of each element of a double precision array, both on the CPU and GPU.
A is the input array, C is the output array from the GPU computation. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. B is an array that we will use to compute a reference solution on the CPU. We will use the standard Fortran allocator for this one.



!
! main.f90
!
program main

use iso_c_binding
use cuda_runtime
implicit none

integer, parameter :: fp_kind = kind(0.0d0) ! Double precision

real(fp_kind) ,pointer, dimension (:) :: A,C
real(fp_kind) ,allocatable, dimension (:) :: B
type(C_PTR)::cptr_A,cptr_C

integer i, N, seed
integer err

! Number of elements in the arrays
N=10

! Initialize the random number generator
seed=1
call random_seed(seed)

! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arrays

write(*,*)'Allocate host memory'
err=cudaSetDeviceFlags(cudaDeviceMapHost)
if (err > 0) print *,"Error in setting cudaSetDeviceFlags=",err

err = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped)
if (err > 0) print *,"Error in allocating A with cuda HostAlloc =",err
call c_f_pointer(cptr_A,A,(/N/))

err = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped)
if (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err
call c_f_pointer(cptr_C,C,(/N/))

! From this point on, we can use A and C as normal Fortran array


! Allocate B using standard allocate call
allocate(B(N))

! Initialize A with random numbers
call random_number(A)


! computing the reference solution on the CPU
write(*,*)'computation on CPU'
do i = 1, N
B(i) = dexp(A(i))
enddo

! same computation on the GPU
write(*,*)'computation on GPU'
call gexp(A,C,N)

! Print the computed quantities
do i = 1, N
write (*,'(i2,1x,4(g14.8))'),i,A(i),B(i),C(i),C(i)-B(i)
enddo

! Release memory
deallocate(B)
err = cudaFreeHost (cptr_A)
err = cudaFreeHost (cptr_C)

end program Main


Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.


/*
kernel_code.cu
*/
#include <stdio.h>

// Device code
__global__ void CUDAexp(double* b, double* c, int N) {
int index = threadIdx.x+blockDim.x*blockIdx.x;
if( index < N) c[index] = exp(b[index]);
}


extern "C" void gexp_(double *a, double *d, int* N1)
{
double *b,*c;
int N=*N1;
cudaError_t statusb,statusc,err;


statusb=cudaHostGetDevicePointer((void **)&b, (void *) a, 0);
statusc=cudaHostGetDevicePointer((void **)&c, (void *) d, 0);

if (statusb != 0 || statusc !=0) {
printf("Error when locating memory to arrays on device!\n");
printf("%s\n",cudaGetErrorString(statusb));
printf("%s\n",cudaGetErrorString(statusc));
}

// Cal the cuda kernel, just one block for this simple example.
CUDAexp<<<1,N>>>(b,c, N);

err=cudaGetLastError();
if(err != 0) printf("Error in kernel execution\n");

// This is very important to retrieve the correct values
cudaThreadSynchronize();
}


Now that we have all the files, let's write a simple makefile


all: TestZeroCopy

TestZeroCopy: main.f90 kernel_code.o
ifort -o TestZeroCopy main.f90 kernel_code.o -L/usr/local/cuda/lib64 -lcudart -lstdc++

kernel_code.o: kernel_code.cu
nvcc -c -O3 -arch sm_13 kernel_code.cu

clean:
rm kernel_code.o TestZeroCopy cuda_runtime.mod



Compiling and running the code, will show the following output:

$./TestZeroCopy

Allocate host memory
computation on CPU
computation on GPU
1 0.39208682E-06 1.0000004 1.0000004 0.0000000
2 0.25480443E-01 1.0258078 1.0258078 0.0000000
3 0.35251616 1.4226426 1.4226426 0.0000000
4 0.66691448 1.9482168 1.9482168 0.0000000
5 0.96305553 2.6196888 2.6196888 0.44408921E-15
6 0.83828820 2.3124052 2.3124052 -.44408921E-15
7 0.33535504 1.3984368 1.3984368 -.22204460E-15
8 0.91532720 2.4975923 2.4975923 0.0000000
9 0.79586368 2.2163544 2.2163544 -.44408921E-15
10 0.83269314 2.2995033 2.2995033 0.44408921E-15