Accesing java objects with JCUDA

Hello, I’m new to CUDA and JCUDA and maybe my question could be out of place, I’m sorry if that is the case.

I’m trying to optimize a process using GPUs, the process is pretty simple but has to run a lot of times, I found that it can be parallelized but the thing is that it depends on a java object. My idea is to create a copy of the object for each parallel process. Now my question is if it is possible to directly pass a Java objects to CUDA? and if CUDA can create copyes of such object.

Thanks in advance. Best regards.

Sorry for the delayed response.

I wrote some general words about the questions that should be taken into account when considering to parallelize a computation in CUDA (or on the GPU), at Using Java with Nvidia GPU’s (cuda) - Stack Overflow

If your problem is indeed data-parallel and compute bound, then there usually are several options for “mapping” a data structure from Java to CUDA. Some more specific hints could be possible if you described what kind of Java objects are involved in the computation, because this will heavily influence how exactly the mapping may take place. If your classes are more complicated, it may quickly become tricky. For example, if you have something like a

List<Node> nodes;

// With
class Node {
    List<Node> neighbors;
    // or
    Map<String, Node> otherNodes;

then you’ll first of all have to think about a representation of the data structure that is suitable for the GPU (and Lists of differently-sized lists, or even Map objects, are distressingly hard to map to the GPU…)

In the simplest case, you might have some “POD”, some Plain Old Data Structure, like a class

class Vector {
    float x, y, z;

    float getX() { return x; }
    float getY() { return y; }
    float getZ() { return z; }

Then you may apply a pattern that is basically similar to the one described at

But here, it also depends on whether you have to use the same structures on the Java side and the GPU side.

Can you describe which classes you are going to use (and how)? (A sketch, showing the basic structure and usage patterns might be sufficient)

Thanks for your reply. The structure that I intent to pass to the gpu is a decision tree, it is basically a graph. I was already thinking on how to convert such structure into a C structure. I think it needs the following fields:

struct Node {
int attribute;
int numberOfArcs;
int classValue; // if the node is a leaf
Arc *arcs;

struct Arc{
int evalType; // to know wich eavluation to apply during classification
float val; // the value to compare against
Node *node; //the node where the arc points

In this way, with a reference to the root node I can access the rest of the nodes. I only intent to use the structure for classification. My computation will consist on doing parallel classifications on a big dataset (I don’t know if I will need to copy the structure as I have parallel process or if the access to the structure could be shared without blocking). I’m positive that I can map the original Java object (wich it is somewhat more complicated) to this structure. The only tricky part I think could be the fact that for a node the number of arcs is variable, I tried to compensate this with a field that indicates the number of arcs.

Once again many thanks.

Yes, such a “variable number of anything” usually is a bit tricky. In fact, the easiest case, if (!) the number does not vary too much (and is not too large, of course), is to allocate a fixed-size array, corresponding to the maximum number of elements.

In general, such a graph can describe a very irregular structure, and … GPUs prefer the “simple, regular” ones. This also refers to what you are going to do with these references/links/arcs: Traversing them might cause memory accesses that are scattered, which is bad for the caches. The best case is to read a large block of memory, from the beginning to the end (memory accesses should be coalesced).

The question about how these links are represented at all might also be relevant. Instead of using Pointers to Structs, one should consider to use plain index sets instead. For example, the nodes could be stored as

struct Node {
    int attribute;
    int numberOfArcs;
    int classValue; 
    int arcIndices[MAX_NUM_ARCS];

so that, instead of following a pointer, you can access the arc of a node like in

arcs[node.arcIndices**].evalType = 42;

But another general recommendation when programming for the GPU is to use a Structure-Of-Arrays, instead of an Array-Of-Structures. So in fact, instead of having a

Node array[] = new Node[n];

the representation that could be best suited for the GPU could be something like

int n = 100;
int attributes[] = new int[n];
int classValues[] = new int[n];
int numbersOfArcs[] = new int[n];
int arcIndices[] = new int[n * MAX_NUM_ARCS];

(Yes, this is horribly inconvenient - but it allows coalesced memory accesses, and additionally avoids any hassle that may be implied by structure alignment issues…)

I’m not a CUDA expert. These hints are the result of an only very basic understanding of how GPUs work, together with a glimpse at things like the Best Practices Guide :: CUDA Toolkit Documentation , and information obtained from other resources. You’ll also have to do some own research here. For example, I think that the constraints for coalesced memory accesses on newer GPUs are not as strong as they have been when I started reading more about CUDA.

You mentioned

I don’t know if I will need to copy the structure as I have parallel process or if the access to the structure could be shared without blocking

Which structure and parallelism does this refer to? Multiple Java Threads? The structure itself has to be copied to the GPU memory anyhow, so the only concurrent accesses there would be between the GPU threads (which might also be an issue, of course)

Many thanks for the response. Yes, I also found that the representation that you suggest is better suited for the gpu, and doesn’t even need to declare an structure, which could be a little troublesome from java.

Regarding my question about the memory, it was a misunderstanding from my part, I tried to refer to the global memory, but I now know that the access to this memory is not synchronized, which was essentially my question.

I’m going to continue with my implementation. Thanks again, best regards.