Simulation with JCuda

Greetings, this is my first post on this forum. After reading a few other threads, I’m sure that this is the best place to find help and maybe share some ideas.

I have written a simulation in Java, of an amorphous computer (for those who are unfamiliar, in this case it is extremely similar to a cellular automata/game of life simulation, but the environment is continuous rather than a grid).

The simulation, once run, executes one ‘time step’, during which a single ‘cell’ is selected at random and a whole bunch of calculations are made which decide how the state of that cell will change/not change. During this calculation, the cell’s neighbours’ states may also change. Since the environment is continuous, neighbourhoods are determined by distance calculations (i.e., any other cell within a set radius from the cell in question is said to be within the neighbourhood of said cell). A cell’s position is given simply by a set of coordinates (x, y). Before beginning the simulation, a number of time steps is selected (say, 10,000,000), and the execution of code for one time step is simply repeated that number of times. Nothing hugely complicated takes place.

In my current simulation, the entire sequence of one time step is executed in one thread, and there is no parallelism. However, I believe that the simulation would benefit greatly from the implementation of JCuda (although that is not why I want to implement JCuda - I want to do it just to see if I can, out of academic interest).

I have managed to get a few small programs running with JCuda, such as examples from Jcuda.org, and I have been reading Cuda by Example (but have only understood some of it - I have no experience with C), however I still require some help. I am pretty sure that as soon as I get one or two simple implementations of Jcuda running (which I have written and understood, rather than downloaded as an example), then I will become much more self-sufficient with it.

At first, I’m not looking to use JCuda en-masse throughout my simulation. What I’d like to do is this:

1: Identify a place within one ‘time step’ where the most benefit would come from a simple calculation being made in parallel (there are many places where JCuda could be implemented, but none seem to be a more obvious choice than any other - perhaps they are all good candidates, but for now I’m just looking for one simple implementation…). There are plenty of instances, within one time step, where many calculations are made in order for all cells where they could be executed in parallel with much greater efficiency. I’ll give a few ideas about this, and then I’m hoping people will feel free to give their own thoughts on it.

2: Once a suitable calculation has been selected to be implemented in parallel rather than in order, I’d very much appreciate it if someone could help me understand what I need to do to be able to get it to work. The Java code is written, and it works. Now I would just like to make another version, the same, but with a simple inclusion of JCuda. I’m not looking for someone to do the work for me - but I’m really stuck since I have no experience with C and I don’t really understand what actually needs to be done (in code) in order to call upon the GPU to perform a simple calculation (but I’m learning fast).

So, for a better description of exactly where I think JCuda could fit in…

  1. There are places in the simulation where the distance between a selected cell, and all other cells, is calculated in order. I believe that, instead, the distance between the selected cell and all other cells could be calculated in parallel (there are a changeable number of cells, but I typically use 10,000 - so it tends to be slow when executed in order).

The code for the calculation is simple (probably not optimal - but there are reasons for the way it is):

    {
        while(elementcount <= numberofelements) {
            Double dist;
            Element f = elements.get(index);
            elements.remove(f);
            ArrayList<Element> local = f.getLocal();
            for(Element a : elements) {
                double x1;
                double y1;
                double x2;
                double y2;
                double r;
                r = f.getRange();
                x1 = f.getXposition();
                y1 = f.getYposition();
                x2 = a.getXposition();
                y2 = a.getYposition();
                dist = Math.sqrt(Math.pow(x1 - x2, 2) + Math.pow(y1 - y2, 2));
                if(dist <= r){
                    local.add(a);
                }
            }
            elements.add(f);
            elementcount++;
        }
        elementcount = 0;
    }```

2. At each time step, the selected element (which is chosen at random) is moved a random short distance in a random direction. This involves repeatedly generating random numbers. Perhaps there could be some parallel implementation of generating random numbers (I haven't spent much time thinking about this option, but I'm sure it could be done). Particularly at the start of the simulation, when the cells are generated, each cell is given random location coordinates. Rather than doing this in order, one by one for each cell, this could perhaps be done once in parallel (just a thought...). I know I just said that the selected cell is the one that moves at each time step, but if there was a way to update the position of all cells, in parallel, for each time step, that would be awesome (although that sounds quite ambitious for a beginner implementation - that's something I'd definitely like to implement in the future once I've learned a little more...).


As you can see, in '1.', the code cycles through an arraylist of cells, calculating the distance for each as it does so, and then adds those in range of the cell in question to another arraylist (this arraylist is actually just a list of the cells that are in the neighbourhood of the cell in question). Rather than cycling through an arraylist of all cells, in order, I'd like to just make all of the distance calculations in parallel.

Thanks for taking the time to read my post. I would appreciate any feedback that you might have that would help me get started with JCuda.

:)

I suppose what I’m really asking is how I could make the linear calculation shown in the code above into a parrallel process.

How would I go about this? Would I have to pass all of the values of the coordinates of every single cell (which I have called ‘elements’ in the code above) to the GPU before I make a calculation? How would I be able to group the values of the coordinates so that they could be associated with the cell to which they belong?

I’m not really sure what I need to pass to the GPU before I make the calculation.

Hello

I’ll post a few thoughts about this later today, I’m currently at work

bye
Marco

before you ask anyone for parallel algorithm tips, there are plenty of code examples around and for myself i read tons of papers and the book “Programming Massively Parallel Processors: A Hands-on Approach”.

“Process of thinking parallel is similar to process of thinking oop, at one point it just clicks and you know how it works!”

Like someone said, if you did not almost reach that point, you have to keep reading and practicing, else even some hints wont work, no one will give you a algorithm for free.

[ol]
[li]How is the input Data structured and how can it fit into 1,2,3 dimensional arrays of primitive datatypes, preferrable 1 dimensional
[/li][li]write down some iterations on paper or in slides and you mostly will see if and how it can be parallelized, most important are independent calculations in your inner loop
[/li][li]data dependencies, you have two loops, is the inner loop dependent on results of outer loop <- maybe indicates a cpu loop with a kernel invokation inside
[/li][/ol]

thoughts about your algorithm, copy data to gpu
[ul]
[li]outer loop element is identified by a block
[/li][li]in your outer loop there is an array of neighbors <-- use shared memory!
[/li][li]threads of the block represent the inner loop and write their neighbor result index to shared memory
[/li][li]output shared memory per block to data to global device memory, for easiness use 2D array, row = element block, column = neighbor index
[/li][/ul]

is the number of neighbors predictable?, if not it can get tricky,

The number of neighbours is unpredictable.

The entire calculation that takes place is shown in the code above - it isn’t pseudo code, it’s the actual code. I’ll clarify it a little:

Each cell is an object of class ‘element’, and each possesses an ‘x’ coordinate and a ‘y’ coordinate, which are double values. Since they are private values for each object, I use a method called ‘getXposition’ or ‘getYposition’ to return those values at each instance (as you can see in the code).

What I’d like to be able to do is take all of the cells and, say, place the coordinates of each into a separate GPU thread, and then perform the calculation (so I have many threads performing the calculation in parallel rather than everything being done in a linear order).

Would that be possible?

I’m not actually asking anyone to do the work for me - I simply don’t know where to begin. I have already had code examples running, and I am reading the book ‘Cuda by Example’.

Hello again,

First of all a short disclaimer: I’m not a CUDA expert. I wrote the bindings and some of the example programs, and even those have mostly been ported from existing examples. I might have been able to grasp some basic concepts, however.

Some of these basic concepts of Parallel Programming with CUDA or OpenCL are straightforward: Whenever you have a ‚for (i=0;i<n;i++)‘ loop, you can consider to let the contents of the loop be executed by ‚n‘ threads in parallel. If this is possible, the problem is called ‚embarassingly parallel‘ - and of course, it’s not always that easy. One good example is the Parallel Reduction from the NVIDIA SDK. The whitepaper contains a step-by-step description of the CUDA implementations and some very sophisticated improvements of the „naive“ approach - and if anyone says that the could have written the final version of the kernel without weeks of diligent studies, I don’t believe it :stuck_out_tongue_winking_eye:

BTW: Although, at the first glance, one might think that the ‚reduction‘ problem is inherently sequential and can not be parallelized at all, it is in fact a very important building block for many algorithms. (Guy Blelloch wrote a lot about using scan/reduction primitives for various algorithms, see „Vector Models for Data-Parallel Computing“ on Guy Blelloch ). Many seemingly sequential algorithms can be executed in parallel by trickily transforming them into reduction problems.

But fortunately, the function you posted looks like there might be the chance to gain some speedup even without too compilcated kernels.

I hope you don’t mind that I’ll first lose some words about the Java implementation:

  • You are declaring ‚dist‘ as a ‚Double‘. It might be preferable to declare it as a ‚double‘, because at the moment there are many unnecessary boxing/unboxing conversions.
  • The computation of the distance uses the ‚Math.pow‘ method two times. This should not be required, since for ‚2‘ as an exponent, a simple multiplication will probably be faster.
  • Square root computations are expensive, and should avoided at … any cost? Well, at least when they can be avoided as easy as in this case :wink:
  • In the outer loop you are removing element ‚f‘ and adding it after the inner loop. I assume that the intention of this was to exclude ‚f‘ from the computation in the inner loop. However, removing an element from a list may be an expensive operation. Assuming that ‚elements‘ is an ArrayList, removing one element will cause ALL subsequent elements to be moved one step to the left, to fill the gap created by removing ‚f‘. Assuming that ‚elements‘ is a LinkedList, this operation itself might be faster, but then the indexed access might be expensive.

It is not clear from the source code where the ‚index‘ comes from, and how ‚elementcount‘ is used exactly. But if I understood everything correctly, you might consider using the following method, which may already be faster

public void findNeighborhoods()
{
    while(elementcount <= numberofelements) 
    {
        Element f = elements.get(index);
        ArrayList<Element> local = f.getLocal();
        double r = f.getRange();
        double rSquared = r*r; // maybe add a f.getRangeSquared() method
        double x1 = f.getXposition();
        double y1 = f.getYposition();
        for(int i=0; i<elements.size(); i++) 
        {
            if (i==index) 
            {
                continue;
            }
            Element a = elements.get(i);
            double x2 = a.getXposition();
            double y2 = a.getYposition();
            double dx = x2-x1;
            double dy = y2-y1;
            double distSquared = dx*dx+dy*dy;
            if(distSquared <= rSquared){
                local.add(a);
            }
        }
        elementcount++;
    }
    elementcount = 0;
}

Concerning the transformation to CUDA, there are some challenges:

  • The majority of GPUs can only compute with ‚float‘ values. ‚double‘ values are only supported by the newest GPU versions. This might probably be changed easily on Java side. It might not be necessary, I’m just mentioning it…
  • When writing an algorithm in Java, you are easily tempted to use sophisticated data structures, like ArrayLists or even HashMaps. Reproducing the behavior of a dynamic data structure like an ArrayList on the GPU may be difficult.

For the algorithm that you posted, it would specifically be interesting to know the exact semantics of the ‚Local‘ ArrayList. As djmj already pointed out: If you can not predict the number of elements that will have to be added to this list (or at least specify an upper bound - for example, that there may not be more than 100 Elements in this list) then it might indeed become difficult. The simplest solution in this case would be to create an n*n-Matrix that stores the neighborhood relations, but then the memory consumption would be prohibitively large. I assume that one more feasible solution would be to compute this neighborhood for each ‚Element‘ and re-use the neighborhood array on the GPU.

So concerning your main question (1) of how to bring the data to the GPU: Yes, you have to pass the coordinates of ALL cells to the GPU, in order to be able to compute the distances in parallel. This data should preferably be stored as a ‚float‘ array. Of course, when you now store all the positions in the ‚Element‘ instances, this might involve copying the data into an array and then copying the array to the GPU, which might involve an overhead that eats up all performance gains. I’m not sure whether it is possible to replace the usage of the 'Element#getXposition/getYposition", or specifically the whole ‚elements‘ list by a data structure that better suits the GPU. Or to be precise: Instead of using an „Array of Structures“ (namely, a List of ‚Element‘ objects) it’s usually beneficial to have a „Structure of Arrays“, something like

private float elementPositions[];
private float elementRanges[];

where the positions could be treated as an array of ‚float2‘ values on CUDA side. (If ‚Element‘ is an interface, this could even be an implementation detail that is hidden from the rest of the program)

As mentioned above, the neighborhood relations could possibly(!) be stored, in an ‚byte‘ array (a ‚char*‘ on CUDA side!) just stating which of the Elements are in the ‚Local‘ area of the currently treated Element. Of course, this data would have to be copied back and translated into the Java structures after the kernel was called.

I’ll try to summarize this in a few lines of Pseudocode:

// Obtain the element positions and ranges. Maybe, for the first 
// tests, by copying the data from the 'elements' List into an array
float positions[] = create();
float ranges[] = create(); 

Pointer devicePositions = copyToDevice(positions);
Pointer deviceRanges = copyToDevice(ranges);
Pointer local = allocate(numElements * Sizeof.BYTE);

for (int i=0; i<numElements; i++)
{
    invokeKernel(numElements, devicePositions, deviceRanges, i, local);

    // Assuming that 'local' now contains an array that has
    // a '1' at indices that are neighbors of 'i', and '0' at all
    // other indices, update the 'Local' ArrayList of all 
    // elements
    copyDataBack(local);
}

Where the kernel would roughly look like this


__global__ kernel(int n, float2 *positions, float* ranges, int index, char *local)
{
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    if (tid < n) 
    {
        float dx = positions[tid].x - positions[index].x;
        float dy = positions[tid].y - positions[index].y;
        float r = ranges[tid];
        float distSquared = dx*dx+dy*dy;
        if (distSquared <= r*r) local[index] = 1;
        else local[index] = 0;
    }
}

(This does not take into account memory coalescing or shared memory)

There are many possible optimizations. And it may be particularly difficult to port ONE part of a larger algorithm to CUDA, while maintaining some very Java-specific data structures for the other parts. Converting and copying memory between the host and the device can easily become a bottleneck, and can take more time than you can save by parallelizing some small computation, as in this case.
Additionally, I know that there are even specific scientific papers about problems like „finding the k nearest neighbors of a given point in a set of points on the GPU“. While this is not exactly what you intend to to, there might be some inspiration in these papers.
But regardless of that, this might be at least a starting point for first CUDA experiments.

Concerning the question (2) about the random positions: You might have seen that there is CURAND (and JCurand) for creating random numbers, but these libraries are not intended to quickly create random numbers, but instead for quickly creating high quality random numbers. That means that using „simple“ random numbers from java.util.Random might be faster, but that depends on how the numbers are going to be used. Maybe you can explain what you mean by „update the position of all cells, in parallel“ - does it mean that you need random directions for ALL cells? Then CURAND might be an option, but one would have to make more detailed benchmarks for this.

bye
Marco

EDIT: A small update of the Java function

I would definitely look at using multiple threads in Java before using CUDA, unless CUDA is your main objective.

Used a fixed thread pool, and if you call this many times, reuse that thread pool. There are many example how to do this - it is quite simple. As a template, look at my matrix multiply code below.

public static double[][] threadedFutureMatrixMultiply(final double[][] a,
        final double[][] b, final double[][] c, final int numTasks,
        final CompletionService<String> pool) {
    final int m = a.length;
    final int n = b.length;
    final int p = b[0].length;
    for (int interval = numTasks, end = p, size = (int) Math.ceil(p * 1.0
            / numTasks); interval > 0; interval--, end -= size) {
        final int to = end;
        final int from = Math.max(0, end - size);
        final Runnable runnable = new Runnable() {
            public void run() {
                final double[] Bcolj = new double[n];
                for (int j = from; j < to; j++) {
                    for (int k = 0; k < n; k++) {
                        Bcolj[k] = b[k][j];
                    }
                    for (int i = 0; i < m; i++) {
                        final double[] Arowi = a**;
                        double s = 0;
                        for (int k = 0; k < n; k++) {
                            s += Arowi[k] * Bcolj[k];
                        }
                        c**[j] = s;
                    }
                }
            }
        };
        String result = null;
        pool.submit(runnable, result);
    }
    try {
        for (int i = 0; i < numTasks; i++) {
            String result = pool.take().get();
        }
    } catch (final InterruptedException e) {
        System.out.println("Interrupted.");
    } catch (final ExecutionException e) {
        System.out.println("Interrupted.");
    }
    return c;
}

Sorry, I didn;t put code as code.

Also, having made what you think is the bottleneck multithreaded, make sure you put it through a profiler to see what the actual bottleneck is.

BTW I agree with Marc’s comments re. Java.

Another trick is to use

(a - b)(a - b) = aa + bb - 2ab

which means you can store aa for each node OUTSIDE the loop, the inner loop just has to calculate ab.

I have a feeling there is a BLAS or LAPACK routine to do exactly what you want? And that has probably already been cuda-fied?


    public static double[][] threadedFutureMatrixMultiply(final double[][] a, final double[][] b, final double[][] c, final int numTasks, final CompletionService<String> pool) {
        final int m = a.length;
        final int n = b.length;
        final int p = b[0].length;
        for (int interval = numTasks, end = p, size = (int) Math.ceil(p * 1.0 / numTasks);
                interval > 0;
                interval--, end -= size) {
            final int to = end;
            final int from = Math.max(0, end - size);
            final Runnable runnable = new Runnable() {

                public void run() {
                    final double[] Bcolj = new double[n];
                    for (int j = from; j < to; j++) {
                        for (int k = 0; k < n; k++) {
                            Bcolj[k] = b[k][j];
                        }
                        for (int i = 0; i < m; i++) {
                            final double[] Arowi = a**;
                            double s = 0;
                            for (int k = 0; k < n; k++) {
                                s += Arowi[k] * Bcolj[k];
                            }
                            c**[j] = s;
                        }
                    }
                }
            };
            String result = null;
            pool.submit(runnable, result);
        }
        try {
            for (int i = 0; i < numTasks; i++) {
                String result = pool.take().get();
            }
        } catch (final InterruptedException e) {
            System.out.println("Interrupted.");
        } catch (final ExecutionException e) {
            System.out.println("Interrupted.");
        }
        return c;
    }

Hi,

I’m also doing a Simulation in Java with Cuda.
In my case it is computational fluid dynamics.

Up to this point I only ported the solvers for linear equation systems to Cuda.
Everything else which is needed by the simulation is really a lot of data therefor the amount of memory
on the GPU will be a limiting factor.
Furthermore I used Cusparse and Cublas (port of BLAS and SparseBLAS for Cuda) to do all needed operations for the solving algorithims.

The main problem to give you an advise is currently, that I’m not sure what your simulation is going to do.

Maybe http://download.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html is interesting for you cause it simplyfies the usage of multiple threads in Java.

Greetings Felix

Thanks for all of the feedback, it is much appreciated.

At first, I’m not too interested in seeing a performance gain. I would just like to be able to successfully implement something. I can look into optimsation once I understand what I am doing a little more. Perhaps when I have learned more, I will simply re-write the simulation completely. For now, I just wish to make this one implementation.

I think I understand that I can use a separate array of float values for the coordinates, instead of using an arraylist of elements and then returning each of those values from each element, in a loop. That makes sense. I think I will need to learn much more about CUDA C syntax before I’m able to do any of that. One further simplification is that all elements have the same range (which I probably should have pointed out from the start).

I’m not sure if it makes any difference, but the CUDA device that I will be using in this case is a Geforce GTX 580.

Here is what I intend to do now:

  1. Try to learn how to add all of the elements’ coordinates to arrays of float values.

  2. Learn how to pass all of that data to the GPU. (I have no idea)

  3. Write a suitable kernel to perform the calculation.

  4. Learn how to pass all of the data back to the host.

  5. Translate the results into a successful implementation.

I’d better get started…

Concerning the second and fourth point, specifically, these is already covered in the samples, and copying data to the device and back always follows the same pattern. For example, the ‘JCudaVectorAdd’ - this might be a good starting point for first own tests. The most “difficult” thing there will probably be to split the initialization part and the part where the actual kernel is called - because you intend to call it repeatedly, while the sample does startup+execution+shutdown in one method.

I know this is probably unrelated, but since you mentioned the VectorAdd example, I thought I’d begin there.

The examples I have working so far are:

  1. A small bit of code that displays all of the CUDA device information. Pretty much just a ‘hello world’ example.

  2. A matrix inversion example from jcuda.org.

  3. The small example at the start of the tutortial from jcuda.org.

I’m now trying to get the vector addition example to work, but I can’t. I managed to compile the kernel using the visual studio command prompt, but it took a while because I am using windows 7 x64, and I had to work out that I needed to use the -machine32 option. After I compiled the kernel to a .ptx file, I now get the following error when i try to run the vectoraddition example:

jcuda.CudaException: CUDA_ERROR_UNKNOWN
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:249)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1578)
at JCudaVectorAdd.main(JCudaVectorAdd.java:96)

And I have no idea what that means. After a brief internet search, I believe that it could relate to the version of JCuda/CUDA that I am using, but I don’t know…

The “unknown error” after launching the kernel is most likely due to an invalid PTX file. The sample tries to compile the PTX file at runtime (if it does not yet exist), but it also only calls the NVCC in the background. Did you try compiling the PTX file with
nvcc -ptx JCudaVectorAddKernel.cu -o JCudaVectorAddKernel.ptx
(or maybe using the additional -machine flag for specifying the architecture)?

If the problems persist, I’ll check this again on a Win64 machine and try to see where the problem might come from.

Yes, I used the above command to compile the .cu file (but with the additional -m32 in order to get it to work on x64). It created the .ptx file successfully, but now I get that error.

Before I did that, I used to get a different error, which was ‘cl.exe missing from path’, so I installed visual c++, and then used that command prompt to compile the .ptx file, and the error changed to the ‘unknown error’ message.

Sorry, the “unknown error” ist not very specific, so this is some guesswork right now: What happens if you explicitly compile it with -m64 ? I’ll try to run some tests on a Win64 machine tomorrow.

The error I get when I use the -m64 option is as follows:

nvcc fatal : Visual Studio configuration file ‘’ could not be found for instiallation at ‘… installation directory…’

If the information helps, I get this error message when I try to run the kernel launcher example from jcuda.org:

jcuda.CudaException: Could not create .cubin file: nvcc fatal : Cannot find compiler ‘cl.exe’ in PATH

For some reason, even after installing msvc, cl.exe hasn’t been added to PATH and I have no idea why…

Concerning the attempt to compile it with -m64: It might be necessary to install 64bit support in Visual Studio. This can be chosen during the installation, but IIRC it is disabled by default. You may want to have a look at http://msdn.microsoft.com/en-en/library/ms241066.aspx

The second error message was once reported in this thread: http://forum.byte-welt.de/showthread.php?t=3260 - maybe the solution there also works for you.

Ok, some good news:

I reinstalled a trial version of MSVC pro, and now when I compile the JCudaVectorAddKernel.cu file with NVCC on the visual studio command prompt, with the -m64 option, the file compiles successfully and when I run the program I get:

„test PASSED“

I assume that this is the desired outcome of the program? (It looks like it, looking at the code)

However, if I do not compile the .ptx file using the command prompt, then I get the same error as before:

‚cl.exe missing from path‘

Will I have to manually change this? A reinstallation seemed to have no effect.

I have read the solution presented in the thread that you linked, but I don’t understand what the other poster actually did to solve the problem. I assume that in ‚1.‘, he set PATH to include cl.exe (although I’m not that sure how to do that myself). I have no idea what part ‚2.‘ refers to.

On another note, I have spend the last few days concentrating only on this, and have been reading the book I have quite a lot, and as soon as I get this issue with cl.exe cleared up, I feel confident that I will be able to make a start at coding my own imlementation of JCuda. I actually understand quite a bit more than I did previously.

Thanks for all the help! :slight_smile:

Ok, I have added cl.exe to PATH. (I just searched for a tutorial, but they are very vague when it comes to win 7)

I still don’t really know what this means:

"2. The second issue is related to the nvcc.profile, which required the following line:

INCLUDES += “-I$(TOP)/include” “-I$(TOP)/include/cudart” “-IC:/Program Files (x86)/Microsoft Visual Studio 9.0/VC/include” $(SPACE)"