Adjacency lista in jCUDA

Luis · 29. Juli 2010 um 14:48

Hello,

I have a problem in creating a struct using jcuda, someone could give me a hand?

Thank you in advance.

Luis · 29. Juli 2010 um 14:55

After making the post I noticed a similar post in http://forum.byte-welt.de/showthread.php?t=2915 but relevant
to comments posted by Marco13, but have been reading some articles that the authors cite the creation of these
structure within the CUDA. If you are interested I can even send them the paper.

And we can think together in order to define these structures, what do you think?

Marco13 · 29. Juli 2010 um 17:15

Hello Luis,

As menitioned in the other thread, structs are not trivial, considering the possible differences between host and device and the resulting alignment issues for 32/64 bit systems. But it’s probably time to get the discussion started and focus on possible solutions. I’ll collect my thoughts, and tomorrow I’ll open a dedicated thread for this topic. I’d be happy to discuss about your ideas then

bye
Marco

Luis · 29. Juli 2010 um 17:49

Hello Marco,

Glad to be an interesting topic, I want to address an adjacency list, need for such a Struct to make notes for the vertices that has a connection with the vertex.

As here, is in Portuguese has more structures may serve as a thought: http://www.lcad.icmc.usp.br/~nonato/ED/Grafos/node76.html.

Based on Structs we will discuss, with your help I can start thinking in that structure, today my graph is in java so it is needed, passes it to the GPU via jCUDA.

Thanks again for the help.

Luis · 29. Juli 2010 um 17:55

Hi Marco,

I’ve been doing some consideration, I’ve been thinking about it, the mapping of C struct to Java using JNI, i want to
ask if there is any possibility how to map C struct to Java using JNI. I need to move some data provided in struct
application to Java using JNI. Is there some JNI predefined type which i could use mapping for that?

Marco13 · 30. Juli 2010 um 07:59

Hello Luis

The most difficult aspect about Adjacency Lists is, that each struct contains pointers. Supporting structs in general may be challenging. But if you have a structure on Java side like

class Node
{
    public int index;
    public Node next;
}

and want to use such a struct on the device, you have to take care that the “next” pointer must always be a device pointer. That means that creating the Graph (which is represented by this adjacency list) may not be done by simply copying or mapping structs from the host to the device. Instead, you have to build the whole graph on the device (otherwise the pointers will not really point to the “next” Nodes).

In fact, I think that specifically for adjacency lists, you might even consider replacing the pointers by indices, and storing all nodes in arrays. (You should also consider that traversing a Graph using adjacency lists with a GPU might have disadvantages, since you are “randomly” accessing elements of the global memory, so there is practically no memory coalescing).

However, I created a thread for the discussion about struct support, maybe we can find a general solution that works for arbitrary structs, and, if possible, even for structs creating pointers to other structs.

bye
Marco

Luis · 30. Juli 2010 um 11:42

Yeah I was thinking something similar as it had once mounted a structure like this for an adjacency list for java.

I’ll put the diagram representing my graph to give you an idea.

Luis · 30. Juli 2010 um 11:47

Also I’m reading a paper in which the author has built a structure using the native CUDA C, but may help us think, I have been relying on it to be well optimized.

I’ll take a look at the new topic as well and see if something new emerges.

Marco13 · 30. Juli 2010 um 12:25

Hi

Concerning the first diagram: You’ll have to think about an appropriate data structure for your problem anyhow. You cannot port something like a “Vector” to CUDA. Most likely it will be possible to boil it down to a “minimalistic” data structure that is appropriate for the algorithm that you want to run on the graph. I think that the most straighforward representation of a graph (as an adjacency list) on the GPU could use indexes. And I assume that this was also meant with the second image: When that you know the maximum number of edges that any vertex has, you could probably represent the adjacency list as something like

struct Node
{
    int index;

    // An array containing the indices of the adjacent edges. 
    // Each Node has such a pointer, and each has the size
    // of "maximum number of edges that any node has"
    int *edgeIndices;
    int numberOfEdges;

    // All the fields from GraphNode and NodeRecord that 
    // are required for the specific algorithm    
    int posX;
    ...
}

struct Edge
{
    int index;
    int vertexIndex0;
    int vertexIndex1;
    float cost;
}

But you should carefully think about that, because there are most likely some representations of graphs that are better suited for massively parallel programming on the GPU. For example: If you have a complete graph (every vertex connected to every other vertex) then the “edgeIndices” of each Node togeher simply form an adjacency matrix, and I assume that for many algorithms (on dense graphs) an adjacency matrix may be much faster than an adjacency list - at least, on the GPU.

bye

Luis · 30. Juli 2010 um 17:03

Dear Marco13,

Trying to answer their questions, my number is at least 50 vertices and edges in number 6 for each vertex. I think I would rather a complete graph (Every Every Other vertex connected to vertex), this means that each vertex has a single path (edge) that connects to another vertex. This will facilitate me in that point after the processing to be performed on the GPU can only return the vertices of the path calculated.

The initial idea that I would represent the shape of the second diagram, by two vectors, two vectors of struct’s.

typedef struct no;
typedef struct heads;

struct nos{
    int VERTEX;
    no *NEXT;
};

struct heads{
    int FLAG;
    no *ADJ;
};

/* Make an allocation in this way, maybe! */

G = (heads *) malloc(NUMVERT*sizeof(head));

cheers

Marco13 · 31. Juli 2010 um 05:01

Hello Luis,

It is important to note the difference if there is a path between each pair of vertices, or if there is an edge between each pair of vertices. (Also see the definitions on this site, „Definições“ number 5 and 7).

…
my number is at least 50 vertices and edges in number 6 for each vertex.

This graph is very small. It will be difficult to achieve a good speedup with CUDA on such a small graph, especially when you are using an adjecency list. But maybe you will use larger graphs later.

The initial idea that I would represent the shape of the second diagram, by two vectors, two vectors of struct’s.

As I mentioned above: It is difficult to create such an adjacency list on the GPU, because the structs are containing pointers. Even if you already have built the graph on the CPU, you can NOT copy it to the GPU, but you have to build the same graph on the GPU again:

// Create the Graph on the CPU
G = (heads *) malloc(NUMVERT*sizeof(head));

// Create the "edge" from Vertex v1 to vertex v2
G[0].adj = (nos*) malloc(sizeof(nos));
G[0].adj[0].vertex = 2;

// Create the "edge" from Vertex v1 to vertex v3
G[0].adj[0].adj = (nos*) malloc(sizeof(nos));
G[0].adj[0].adj[0] = 3;
....


// Now, do the same on the GPU, using cudaMalloc instead of malloc
G = (heads *) cudaMalloc(sizeOfHead);

// Create the "edge" from Vertex v1 to vertex v2
G[0].adj = (nos*) cudaMalloc(sizeOfNos));
G[0].adj[0].vertex = 2;

// Create the "edge" from Vertex v1 to vertex v3
G[0].adj[0].adj = (nos*) cudaMalloc(sizeOfNos));
G[0].adj[0].adj[0] = 3;
...

And doing the same in Java, with Structs that only contain an index and a pointer, would really be no fun -_-

Maybe you could use a representation of the adjacency list that does not require pointers. As described above: Instead of pointers you could use indices, this would be much simpler. Apart from that, there probably are representations of graphs that are better for computations on the GPU. For example, you might consider using a Matriz de Adjacência, which would be much easier to handle on the GPU, and since the graph only consists of 50 vertices, it does not require too much memory. But if you still want to use an adjacency list, maybe we can find a solution for that.

bye
Marco

Luis · 31. Juli 2010 um 07:51

Hi Marco13,

Yes, the correct definition would be my graph is connected. The idea is to adapt myself later larger graphs.

My graph it exists just in the CPU through a diagram that I sent that part is implemented and running. The problem that I found from start to think about using an array of adjacency, is that it uses a lot of useless space (we use only the upper diagonal matrix), it might be a better Incidence Matrix, here Matriz de incidência – Wikipédia, a enciclopédia livre.

Maybe you could use a representation of the adjacency list that does not require pointers. As described above: Instead of pointers you could use indices, this would be much simpler.

This is a nice idea, as you see it in jCUDA?

One question I have is this going to do a lot of processing on top of the adjacency matrix for each thread to calculate the best possible way, using an algorithm called D * (a dynamic version of A *), I will run all threads at the same time on our matrix, but is in doubt following the adjacency matrix would be simpler for being a matrix between vertices, since the incidence matrix is the relationship between vertices and edges, seeing the side of the result to be returned should be the vertices of the path then calculated, and this is what I still can not see, do not know if I should spend a label for the vertices in the matrix, or how.

Thank you very much for the help.

Marco13 · 31. Juli 2010 um 13:40

Hello Luis,

There are many design choices for implementing a Graph that should be used for GPU computations. And I do not have enough experience to say which implementation might be the best for a specific algorithm. I just thought that creating an adjacency list with structs and pointers on the GPU will probably be difficult and complicated, and that it might be hard to optimize it, because there can be no memory coalescing and it’s hard to use shared memory in this case.

Additionally, a massively parallel version of an A* (or D*) algorithm is certainly not trivial. You should take care that the representation of your graph on the GPU matches the algorithm AND the specific requirements for GPU computations. (I assume that you know that simply taking standard CPU algorithms and data structures, and replacing a “for(…)” loop with a kernel that is executed by 1000 threads most often does not work so well…)

In any case, please do not overestimate my comments, since I’m NOT a CUDA expert :o

Concerning the “adjacency list without pointers”: I just thought that one possible representation of the graph could be as a matrix where each row represents a vertex, and each column represents the adjacencies of this vertex. That is, a matrix that is simply filled with the data that may be obtained by traversing the adjacency list. For example, for the image in the link that you posted, you might have a matrix like


V | Matrix
--+------------
1 |  2  3  .  .   
2 |  1  4  5  .
3 |  1  6  7  .
4 |  2  8  .  .
5 |  2  8  .  .
6 |  3  8  .  .
7 |  3  8  .  .
8 |  4  5  6  7

With an adjacency list, you can traverse all vertices that are adjacent to a given vertex like this:


Adjacencies current = vertex->adj;
while (current != NULL)
{
    printf("Vertex is adjacent to %d
", current->vertexIndex);
    current = current->next;
}

Similarly, with such a matrix, you could traverse them like this:


int column = 0;
int current = matrix[v*n+column];
while (current != 0)
{
    printf("Vertex is adjacent to %d
", current);
    column++;
    current = matrix[v*n+column];
}

(only a rough sketch)

The weights (costs) of the edges could be stored similarly


V | Matrix
--+------------
1 | 0.1  0.2   .    .   
2 | 0.2  0.3   .
...

indicating that the edge between vertex 1 and vertex 2 has weight 0.1

But again: I do not know if this is a “good” representation of the graph, or if it is good for the GPU or the specific algorithm that you want to implement. I only know that quite often algorithms or data structures which are “standard” for one thread on the CPU are not well suited for 1000 threads on the GPU.

Maybe I’ll find the time to try out some possible implementations of graphs in CUDA. Additionally, I’m sure there already are some implementations, maybe a look at their source code will bring some insights.

bye
Marco

Luis · 2. August 2010 um 09:51

Hi Marco13,

What you meant by „coalescing memory“?

Additionally, the massively parallel version of an A* (or D*) algorithm is Certainly not trivial.

The D* algorithm, by analizes I made an algorithm is very close to the massively parallel, having an independent execution for each instance and an interesting observation is that it performs the same changes to the paths in a matrix and at the same time for all threads. So I think he has a great chance to run on GPU without any problems.

As much as you do not conscidere an expert on CUDA, has been very useful comments, beyond which was the first person to give me a real help! This has been a great big help for me I’ve been crawling in learning.

Maybe I’ll find the time to try out some Possible implementations of graphs in CUDA.

The representation of the graph is exactly this, but one of my big problems is that the CPU graph is implemented in Java and therefore the questions in previous replys, I’m trying to make a representation in Java to send it and the GPU instead of putting costs will represent with one’s edges between the vertices and then have to represent the other threads and objects in my field (not sure if you know this from me before, my graph is on the ground, where characters controlled by the GPU should move.) and having an update, due to a possible collision where the D* will recalculate the trajectory, so I think running multiple instances will not have problems update rules and collision are local.

Thanks again for the help.

Marco13 · 3. August 2010 um 03:16

Hello Luis,

Concerning “memory coalescing”: There is a lot of information in the CUDA Best Practices Guide (this PDF, section 3.2.1). It means that global memory should be accessed in a special pattern. In the simplest case, the threads should access memory locations which form a contiguous block with a certain alignment. This is important because in many cases the data transfer between the host and the device may be the bottleneck for the application.

Remember that in CUDA there is not really an “independent execution”. CUDA is intended for data-parallel programming, and not for task-parallel programming. The threads are not independent of each other (like Java Threads). All threads have the same instruction pointer, and “all threads are doing the same, on different positions of the data”. One could say that you have to implement your D* algorithm in a way that the only difference between all threads is that they use different “threadIdx.x/y/z” and “blockIdx.x/y/z” values.

I’m afraid I did not really understand what you meant here…
**I’m trying to make a representation in Java to send it and the GPU instead of putting costs will represent with one’s edges between the vertices and then have to represent the other threads and objects in my field **
Is your question…

still how to create and use an adjacency list in CUDA? (This may be difficult, as I already explained)
How to represent a graph so that it may be processed efficiently with CUDA? (I assume that a Matrix would be good here, but I’m not sure)
Or how to obtain the result of the computation (namely, the shortest paths), and how to copy this data from CUDA back to Java?

bye
Marco

Luis · 3. August 2010 um 07:57

Hi Marco13,

It means that global memory should be accessed in a special pattern. In the simplest case, the threads should access memory locations which form a contiguous block with a certain alignment. This is important because in many cases the data transfer between the host and the device may be the bottleneck for the application.

About this I had already been alerted and decided to test to see better, in my case at times I will have input and output, but it can grow to the extent that the various threads have several due to demand adjustment in the trajectories of the pathways.

All threads have the same instruction pointer, and „all threads are doing the same, on different positions of the data“. One could say that you have to implement your D* algorithm in a way that the only difference between all threads is that they use different „threadIdx.x/y/z“ and „blockIdx.x/y/z“ values.

In the problem that makes much sense for the code which I apply on each thread is the same running on the same matrix only differ in the way that each thread will compute as the starting point is different and the same goal.

I’m afraid I did not really understand what you meant here…
I’m trying to make a representation in Java to send it and the GPU instead of putting costs will represent with one’s edges between the vertices and then have to represent the other threads and objects in my field
Is your question…

still how to create and use an adjacency list in CUDA? (This may be difficult, as I already explained)

How to represent a graph so that it may be processed efficiently with CUDA? (I assume that a Matrix would be good here, but I’m not sure)

Or how to obtain the result of the computation (namely, the shortest paths), and how to copy this data from CUDA back to Java?

Actually I already have the graph it is in java, at one point, now I have to take it to the GPU with jCUDA, this was a great view of the topic, because he understood how to do it and you helped me in this.

Subsequently the algorithm that I’ve been working calculate on the adjacency matrix, as follows where he solve transpose the array it will add its id to the cost of the position which prevents it viso put id’s high, then no other thread will see as a possible way I must also create a mechanism to represent the other static and dynamic objects, things to do later on, within the matrix as the characters so they can divert them, these sites the cost is too high and prevents the drive.

Put a picture of the dynamics to give you an idea.

I thank you again for help.

Marco13 · 3. August 2010 um 09:14

Hello,

So you have ONE representation of the graph itself, and all threads are working with this graph? But each thread has a “distance matrix”, similar to those in the image? And each thread may modify the distance matrix of each other thread?

BTW: How are you going to represent the trajectories/pathways? You have to allocate the memory for the result before you start the kernel, and you do not know how long the path will be…

bye

Luis · 4. August 2010 um 06:04

Hi Marco13,

So you have ONE representation of the graph itself, and
all threads are working with this graph?

Yes, just that.

But each thread has a „distance matrix“, similar to those in the image? And each thread may modify the distance matrix of each other thread?

In fact every calculation that will only exist for a thread.

How are you going to represent the trajectories/pathways?

The application id in the cost of each vertex, will do that. If you can check it out here: Incremental heuristic search methods use heuristics to focus Their search and reuse information from previous searches to find solutions to series of similar search tasks much faster than is Possible by solving each search task from scratch. In this paper, we apply Lifelong Planning A * to robot navigation in unknown terrain, including goal-directed navigation in unknown terrain and mapping of unknown terrain. The resulting D * Lite algorithm is easy to Understand and analyze. It implements the behavior Stentz The Same 'focussed Dynamic A * but is algorithmically Different. We prove properties about D * Lite and experimentally Demonstrate the Advantages of combining incremental and heuristic search for the applications studied.

Retrieved from here: http://idm-lab.org/bib/abstracts/papers/aaai02b.pdf

Marco13 · 4. August 2010 um 09:37

Hello Luis,

I’ll probably not have the chance to study the paper in detail. In any case, I think that it is an ambituous goal to port such an algorithm to CUDA: It uses many conditional branches, and “sophisticated” data structures (like a priority queue), and porting such a structure itself to CUDA is probably not trivial.

Do you intend to reconstruct the path on the GPU as well? I could imagine that this is a task that could also be done on the CPU (but maybe this is not possible because the actual distance values only reside in GPU memory?).

bye
Marco

Luis · 6. August 2010 um 11:32

Do you intend to reconstruct the path on the GPU as well?

The idea is only in truth, I spend the vertices of the path.

I could imagine that this is a task that could also be done on the CPU (but maybe this is not possible because the actual distance values only reside in GPU memory?)

Yes, that’s right…