Rewriting a Java program to JCuda

Hi, I am completely new to JCuda.
For a Java program that does computation (e.g., data analysis). I wonder how hard it is to rewrite/convert it to the JCuda framework.
Anyone has similar experience?
Thanks!

Hello

Although I don’t really have „experience“ with rewriting a whole application to JCuda: It’s hard to say how hard it is :wink:

It depends on several factors. JCuda is intended to mimic CUDA. And in general, there are stilistic differences between „well-designed, object oriented Java“ and CUDA. So in any case, before you set out to re-write even parts of your application, you should have made clear

  • which part you want to accelerate through data-parallel GPU computing
  • how you intend to achieve a speedup there and
  • how large the speedup might be
  • how the GPU computing part may be integrated into the remaining application

In any case, I’d recommend to really try to draw a clear and strict separation line between the GPU-related part and the remaining part. (This applies to all pieces of software, but there are several reasons to emphasize this for JCuda/CUDA).

Can you elaborate which kinds of data analysis you intend to run on the GPU?

bye

Thanks for the answer.

I plan to implement a series of statistical functions that might involve dealing with large amounts of high-dimensional data. For example, when implementing Bayes’ theorem, the program needs to do some ‘counting’ across all the dimensions of the data and across all the data points. And the goal is to make such computation real time, similar to how an Excel cell responds to a function typed in it.

OK, I’ve never been so familar with statistics and probabilities (it scares me - too many paradoxes are based on that! :eek: :wink: ).

I wonder which parts of the application have been identified as bottlenecks and should be accelerated with data-parallel GPU processing, and what the intended „interface“ for these functions looks like.

For example, when the basis for the computation is some object-oriented structure of classes (for Vectors or so), it may be hard to port this to CUDA. If you rely mainly on interfaces, it might be easier (that depends). And if you are already working on simple float[] arrays, it should be the easiest.

Similarly for the computation and the „data processing workflow“ itself: Depending on how compute-intensive the analysis actually is, and how the GPU-accelerated step should be integrated into the remaining application, there may be more or fewer hurdles to take.

Although I’m certainly NOT a CUDA expert: If you can give an example of such a computation, and maybe some info about the data structures that are involved, I might be able to give a more precise answer.

(BTW: I assume that this should only be an example, but if you intend to define the function that is to be executed at runtime, you might consider using OpenCL/JOCL. It’s conceptually similar to CUDA, and comes with a built-in compiler, so the actual kernel function can easily be defined at runtime).

So you mean the more primitive (i.e., less OO) the data structure the easier it is?

Actually I haven’t started yet. I just wonder if I can start with a non GPU version and modify it later when I find using GPU is necessary.

I will come back with more questions once I get started. Thanks!

In any case, in order to copy memory from the host to the CUDA device, you will need the data in a “flat” from - that is, essentially, a primitive array or buffer. This means you can not create something like

class Entry {
    float conditionalProbability, unconditionalProbability, ...;
    // getters, setters, other methods...
}
class Matrix {
    List<List<Entry>> entries;
    // getters, setters, other methods...
}

and then copy this data directly to the device. In order to copy this to the device, you eventually need this data, for example, as a float array[] = new float[matrixRows*matrixColumns*3]. But of course, for the rest of the program, you do certainly not want to mess around with one large array. Instead, you’d like to keep the object-oriented view on the data. The most appropriate solution for this target conflict depends on many factors, and there certainly is no silver bullet. But one conceptual pattern that might be applied here is to use appropriate interfaces and implementations. I think I already started a short “tutorial”/example showing how this might be implemented, maybe I can finish this later today - but for now, you might want to have a look at http://code.google.com/p/aparapi/wiki/AparapiPatterns , where this idea is sketched in the section “How can I use Aparapi and still maintain an object-oriented view of my data?”. You basically create “flat”, primitive arrays that may efficiently be copied to the GPU, and put an object-oriented view on top of this array.