Totally new to CUDA


Hi everyone,

I’m totally new to CUDA, not to Java though.

I’m already doing multi-processor programming in Java using FastMPJ. How much of a learning curve and effort would be to port my multi-processor program to run in a CUDA-enabled GPU?? Is it worth the effort?? How much speed up can I expect in Matrix-Vector operations, if any??

Any feedback on your own experiences would be greatly appreciated.

Thank you.




Whether you can achieve a speedup for a particular operation depends on several factors. A while ago, I wrote some general words about that on stackoverflow (also some words of “warning”: Parallel programming on the GPU is rather “low level” compared to other forms of parallel programming in Java. However, I’ve never actively used a message passing system, so can not say much about that…)

Matrix-Vector operations in general can be very well suited for the GPU. There are still some guidelines to be followed in order to achieve a good performance. Most importantly, to make sure that the memory is not copied unnecessarily back and forth between the device (GPU memory) and the host (main memory).

But fortunately, you don’t have to write your own kernels for things like this (which makes using it muuuuch easier) : CUBLAS (and in this case, JCublas) already offers the full BLAS Level 1,2 and 3 routines. You might want to have a look at the “JCublas2Sample” (the second one on the page). It performs a Matrix-Matrix multiplication with CUBLAS. For Matrix-Vector operations, the speedup will not be as large as in this case, but depending on the size of the matrices and vectors, may still be considerable.