Comments on CUDA Musing: Calling CUDA Fortran kernels from MATLAB

Hi, since TF v1.0 released yesterday, I'm look...

2017-02-16T14:35:09.194-08:00

Hi, since TF v1.0 released yesterday, I'm looking forward to the new tutorial.
Thanks for this, You're helping a lot of people!

Hi. That might have some effect, but the performa...

2013-09-14T06:58:40.661-07:00

Hi. That might have some effect, but the performance difference (~2x) doesn't seem to change for codes that call the kernel even hundreds of thousands of times over the course of execution. I've always assumed that there is some extra amount of processing that is done to the code to incorporate it into a *.mexa64 as opposed to a *.ptx (and, as you say, the corresponding SASS assembler...i didn't know that; thank you for the insight...) but I don't pretend to know. Whatever the case, for the same kernel with the same thread grid configuration, I consistently see the ~2x performance hit which is enough for me to go through the extra effort to use the MEX interface once the kernel is thoroughly tested.

I think Matlab needs to JIT the PTX file to genera...

2013-09-13T22:58:21.252-07:00

I think Matlab needs to JIT the PTX file to generate SASS assembler (the native code that is going to execute on the GPU).
Have you tried to call a warm-up kernel? I have noticed that the following invocations are much faster.
The computer I was using had an old GPU, but I just replaced it with a faster one and will do more performance testing.

The PGI toolchain and MEX don't play nice together, if I get something to work I will post my findings.

Excellent. Can you also extract a pointer (to dat...

2013-09-13T11:57:17.283-07:00

Excellent.

Can you also extract a pointer (to data resident on the device) from a mxGPUArray object in a MEX file and pass to a function written in CUDA Fortran? Do the MEX and pgf90 toolchains work nicely together for CUDA codes?

The only reason I ask is that, while I agree that the cudaKernel interface as you describe is a lot more convenient than MEX files, I have noticed a significant performance penalty on the cudaKernel (with the *.ptx file) vs calling the same kernel on GPU data in a MEX file. (...and, of course, the resulting *.mexa64).

a. I don't really know why the performance is better with the *.mexa64...I just know that it is.

b. I wouldn't be surprised if it's just that I'm doing something wrong with the *.ptx file and the cudaKernel interface.

c. Do you notice any performance change in the *.ptx version compared to other codes that incorporate the same CUDA Fortran kernel?

Thanks again. Nice blog.