Simulation with JCuda

Ok, sorry to add another post.

I retried the JCudaVectorAdd example - and this now works even without a .ptx file. It seems that adding to the PATH manually cleared the problem.

So, I tried the KernelLauncher example, and this ran longer than it previously did, but then it stops and gives this error:

jcuda.CudaException: CUDA_ERROR_INVALID_SOURCE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:249)
at jcuda.driver.JCudaDriver.cuModuleLoadDataEx(JCudaDriver.java:1857)
at jcuda.utils.KernelLauncher.initModule(KernelLauncher.java:688)
at jcuda.utils.KernelLauncher.create(KernelLauncher.java:395)
at jcuda.utils.KernelLauncher.create(KernelLauncher.java:321)
at jcuda.utils.KernelLauncher.compile(KernelLauncher.java:270)
at KernelLauncherSample.main(KernelLauncherSample.java:36)

I’m not convinced that it isn’t because there is more I need to add to path, but I don’t know.

EDIT: Ok, so I tried a few more things, and nothing helped.

So, the JCudaVectorAdd problem is gone - that works fine now.

The only problem I have left so far is the error shown above - I have no idea what is causing that during the KernelLauncher example.

Ok, I have solved the problem:

I found this thread: „http://forum.byte-welt.net/archive/index.php/t-2958.html

And I used the solution posted there. I don’t understand why the addition of „-sm_20“ worked, but it does!

Thanks! :slight_smile:

Now I’ll try writing my own implementation…

That’s good news. I think that the initial setup, including the installation, and the compilation of the first PTX file, may be the greatest hurdle before getting started with real programming. Once the vector addition example works as expected (indicated by the output ‘PASSED’) one has the basis for first own kernels.

Concerning the KernelLauncher and the -sm_20: The KernelLauncher sample still uses CUBIN files, which are specific for the device version (i.e. the “Compute Capability”). As explained in the tutorial, PTX files are more flexible and versatile, and the KernelLauncher will soon be updated to prefer PTX files as well.

Originally, the KernelLauncher was mainly intended to simplify the tedious invocation of kernels in pre-4.0-versions of CUDA. You may still find examples how this had to be done, with setting individual parameters and obeying alignment requirements… In CUDA 4.0, this is much simpler. However, since the KernelLauncher still offers some convenient functions for compiling and loading modules, it will also be maintained in the future.

I found the code I was thinking of. Go to jblas.org, and take a look at the class geometry. This may give hints on how to use JCUBLAS as the underlying engine.

I did manage to implement JCuda in my simulation, and write my own Kernel (which works). However, since it was a rushed job, just done to see if it would work, it is much slower than my original simulation.

Now that I have something working, I intend to re-write the entire simulation with parallelisation (with JCuda) in mind - which will hopefully make the whole thing run much better.

Thanks for the link - I’ll take a look now!

:slight_smile:

Hello

As I mentioned above: Maybe you do not have to rewrite everything from scratch. The most “severe” change would probably be to switch everything from ‘double’ to ‘float’ (assuming that you are using float values in your kernel?) but this is not a real “structural” modification, but in the best case done by replacing text and adding some casts.

And concerning the “structural” aspect: In order to keep as much as possible of your original pipeline, you might consider writing the “Element” differently. This is just a rough idea - NOT a recommendation, but you might consider it and see if it suits your needs. But until now, you presumably have something like

class Element
{
    float x;
    float y;

    float getXposition()   
    {
    }
...
}

class Sim
{
    private List<Element> elements = new ArrayList<Element>();

    private void createElements(int n)
    {
        for (int i=0; i<n; i++) 
        {
            elements.add(new Element());
        }
    }

    private void computeSomething()
    {
        for (int i=0; i<elements.size(); i++)
        {
            Element e = elements.get(i);
            float x = e.getXposition();
            float y = e.getYposition();
            // Modify x and y somehow...
            ...
            e.setXposition(x);
            e.setYposition(y);
        }
    }

And most likely, there are several methods like ‘computeSomething’ representing individual steps of the simulation. If you replaced this by something like

class Element
{
    private float array[];
    private int index;

    float getXposition()   
    {
        return array[index*2+0];
    }
    float getYposition()   
    {
        return array[index*2+1];
    }
}

class Sim
{
    private float elementPositions[];
    private List<Element> elements = new ArrayList<Element>();

    private void createElements(int n)
    {
        elementPositions = new float[n*2];
        for (int i=0; i<n; i++) 
        {
            Element element = new Element(elementPositions, i);
            elements.add(element);
        }
    }

...

Then you could, replace individual methods of the simulation with CUDA functions that directly operate on the ‘elementPositions’. This could be done step-by-step, still keeping the other methods, which may still operate on the Elements just like they did before (using the getXposition/getYposition methods).

Of course, there are some assumptions (e.g. that the order of the elements in the list does not change), but it might be a feasible approch, compared to a “big bang” and rewriting everything from scratch…

bye
Marco

Thanks for that suggestion.

I have made a few modifications to the altered program, and it now runs faster than my original. However, I still think that there is plenty of space to get even better performance.

Currently in my program, I have block size set to 256:

int blockSizeX = 256;

How is this value chosen? If I have a list of my GPU capabilities, then is there is a way to select a more optimal block size?

Choosing the optimal block size is a science :wink: As a rule of thumb, you may get a higher occupancy with a larger block size (that roughly means that you keep all the cores of the GPU busy, and this achieve a high performance), but there are other things that may influence the performance. For example, when you are using many local variables, then the GPU may run out of registers when the block size is too large, which may cause it to slow down again. The shared memory also plays a role, and all that also depends on the Compute Capability…
NVIDIA created the Occupancy Calculator (Forum Thread with link) that helps to optimize the block size, but admittedly, I did not yet systematically use it for own purposes and optimizations, so I can’t help you so much with that.
For the beginning, if you have a maximum block size of 512, then for „simple“ kernels, block sizes of 256 or 512 should be good candidates to start with.

The Nvidia programming guide also has relevant informations about block sizes and shared memory array sizes.

If you are sure that your kernels do not exceed the maximum amount of registers per SM using 8 blocks, a general first approach is:

 * Number of Threads in a Block
 *
 * Maximum number of resident blocks per multiprocessor : 8
 *
 * ///////////////////
 * Compute capability:
 * ///////////////////
 *
 * Cuda [1.0 - 1.1] =	
 *	Maximum number of resident threads per multiprocessor 768
 *	Optimal Usage: 768 / 8 = 96
 * Cuda [1.2 - 1.3] =
 *	Maximum number of resident threads per multiprocessor 1024
 *	Optimal Usage: 1024 / 8 = 128
 * Cuda [2.x] =
 *	Maximum number of resident threads per multiprocessor 1536
 *	Optimal Usage: 1536 / 8 = 192
 */
public static int BLOCK_SIZE_DEF = 96;

With this you make sure the Architecture is completely busy!

This information is from the following book and is easy to follow:
“Programming Massively Parallel Processors, A Hands-on Approach”

Until I did not checked the performance of all my Kernels regarding their register counts, and block sizes and sm array sizes, i still take this approach with caution. But for a start its a good choice for sure!
And all the Data can be retrieved programmatically.

Reduction Kernel

/**
 * Number of Threads in a Block, mainly for reduction kernels,
 * must be a power of two variable
 *
 * ///////////////////
 * Compute capability:
 * ///////////////////
 *
 * Cuda [1.0 - 1.1] =	
 *	BLOCK_SIZE_DEF = 96 --> 128 --> results in usage of 6 of 8 possible resident blocks per multiprocessor
 * Cuda [1.2 - 1.3] =
 *	BLOCK_SIZE_DEF = 128 --> 128
 * Cuda [2.x] =
 *	BLOCK_SIZE_DEF = 192 --> 256 --> results in usage of 6 of 8 possible resident blocks per multiprocessor
 */
public static int BLOCK_SIZE_POW2 = 128;

if you use a reduction kernel you mostly use a block size equal to a power of two integer, so in the above case only Cuda Cuda [1.2 - 1.3] fullfills it by default, so I round up to the first greater power of two number.

For my Bachelor thesis i don’t have enough time to do a complete performance and occupancy analysis for each kernel, so I will stick with those default values.

Ok,

So what I am finding is that even though my simulation works with JCUDA, the standard version is very slightly faster without it.

I have spent some time looking it over, but I can’t figure out how to make it quicker. It may simply be that I’m not making an efficient use of CUDA in this case. I’ll post the code I’m using for only a CPU simulation, and then I’ll post what I did using CUDA. If anyone can help in identifying what might be particularly slowing down the simulation, then I would be very grateful.

CPU only simulation:



public void Event()
    {
        Random selectelement = new Random();
        theselection = (selectelement.nextInt(concern.size()));
        Element e = concern.get(theselection);
        ArrayList<Element> local = e.getLocal();
        elements.remove(e);
        local.clear();
        for(Element a : elements) {
            double x1;
            double y1;
            double x2;
            double y2;
            double r;
            r = e.getRange();
            x1 = e.getXposition();
            y1 = e.getYposition();
            x2 = a.getXposition();
            y2 = a.getYposition();
            dist = Math.sqrt(Math.pow(x1 - x2, 2) + Math.pow(y1 - y2, 2));
            if(dist <= r){
                local.add(a);
            }
        }
        elements.add(e);

... the rest of the code after is the same in both simulations ...


With CUDA:



private CUfunction function;
private CUcontext context;

...


public void Event(int numsteps) throws IOException
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        
        // Create the PTX file by calling the NVCC
        String ptxFileName = preparePtxFile("CalculateNeighbourhood.cu");
        
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        
        // Load the ptx file.
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);
        
        // Obtain a function pointer to the "add" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "distance");
        
        // Allocate and fill the host input data
        float hostInputX[] = new float[numElements];
        float hostInputY[] = new float[numElements];
        
        int numberofcycles = 0;
        while(numberofcycles < numsteps) {
            
            Random selectelement = new Random();
            theselection = (selectelement.nextInt(concern.size()));
            Element e = concern.get(theselection);
            ArrayList<Element> local = e.getLocal();
            elements.remove(e);
            local.clear();
        
            double eXX = e.getXposition();
            double eYY = e.getYposition();
        
            eX = (float) eXX;
            eY = (float) eYY;
        
            // Fill the float arrays with the correct positional data
            while(elementcount <= (numberofelements - 2)) {
                Element a = elements.get(elementcount);
                float x1;
                float y1;
                double x2;
                double y2;
                x2 = a.getXposition();
                y2 = a.getYposition();
                x1 = (float) x2;
                y1 = (float) y2;
                hostInputX[elementcount] = x1;
                hostInputY[elementcount] = y1;
                elementcount++;
            }
            elementcount = 0;
        
            // Allocate the device input data, and copy the
            // host input data to the device
            CUdeviceptr deviceInputX = new CUdeviceptr();
            cuMemAlloc(deviceInputX, numElements * Sizeof.FLOAT);
            cuMemcpyHtoD(deviceInputX, Pointer.to(hostInputX), numElements * Sizeof.FLOAT);
            CUdeviceptr deviceInputY = new CUdeviceptr();
            cuMemAlloc(deviceInputY, numElements * Sizeof.FLOAT);
            cuMemcpyHtoD(deviceInputY, Pointer.to(hostInputY), numElements * Sizeof.FLOAT);
        
            // Allocate device output memory
            CUdeviceptr deviceOutput = new CUdeviceptr();
            cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT);
        
            // Set up the kernel parameters: A pointer to an array
            // of pointers which point to the actual values.
            Pointer kernelParameters = Pointer.to(
                Pointer.to(new int[]{numElements}),
                Pointer.to(deviceInputX),
                Pointer.to(deviceInputY),
                Pointer.to(new float[]{eX}),
                Pointer.to(new float[]{eY}),
                Pointer.to(deviceOutput)
                );
        
            // Call the kernel function.
            int blockSizeX = 512;
            int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
            cuLaunchKernel(function,
                gridSizeX,  1, 1,      // Grid dimension
                blockSizeX, 1, 1,      // Block dimension
                0, null,               // Shared memory size and stream
                kernelParameters, null // Kernel- and extra parameters
                );
            cuCtxSynchronize();

            // Allocate host output memory and copy the device output
            // to the host.
            float hostOutput[] = new float[numElements];
            cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput, numElements * Sizeof.FLOAT);
        
            for(int i = 0; i < numElements; i++) {   
                if(Math.abs(hostOutput**) > 0) {
                    Element b = elements.get(i);
                    local.add(b);
                }
            }
        
            // Clean up.
            cuMemFree(deviceInputX);
            cuMemFree(deviceInputY);
            cuMemFree(deviceOutput);

... More code the same in both...

... Then at the end of the Method:

cuCtxDestroy(context);


The Kernel:



extern "C"
__global__ void distance(int n, float *a, float *b, float c, float d, float *local)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    {
	float e = a** - c;
	float f = b** - d;
	float dist = e*e + f*f;
	float range = 28;
	if (dist <= range*range) local** = 1;
	else local** = 0; 
    }

}


Two sets of code that basically take the element in question and then calculate which elements in the simulation are in range of that element.

I know it’s almost rude to throw out a pile of code and ask if anyone can clean it up, but I am really stuck with this. It may simply be that I’m not doing enough with the kernel in order for this simulation to be improved with CUDA, but I don’t know.

Thanks

Sam

Hello

Alluding to your signature: Some aspects have already been pointed out in the previous responses :wink: This also refers to the plain Java simulation - but I see (or assume) that you first wanted to get it running, and now we can see what can be optimized - and how.

There are some possible glitches with the code, but I only quickly looked over it, so I’ll go into details when I have more time and after the first questions have been clarified. (Still, it might be the case that the computation is too „simple“ (requires relatively few arithemtic operations) and too memory bound to achieve a good speedup, but we’ll see…)

First of all: It is not completely clear when and how the main „Event“ method will be called. Although the number of steps is passed as a parameter, it looks like it might be the case that this method is called freqently - assuming that the usage is roughly like this:

Simulation sim = new Simulation();
for (int i=0; i<runs; i++)
{
    sim.Event(10);
}

IF it is the case, and IF the Event-method is called several times, you might first try to pull out everything that is related to the initialization. The straightforward approach would be to pull out the first lines (up to ‚cuModuleGetFunction‘) into a method - and possibly the corresponding code for shutting down

class Simulation
{
    private CUfunction function;
    private CUmodule module;

    public void startup()
    {
        // initialize function and module here...
    }

    public void shutdown()
    {
        // Shutdown and release module etc.
    }

    public void Event(int numSteps)
    {
        // Allocate and fill the host input data
        float hostInputX[] = new float[numElements];
        ...
    }
}

This would turn the usage to

Simulation sim = new Simulation();
sim.startup();
for (int i=0; i<runs; i++)
{
    sim.Event(10);
}
sim.shutdown();

Even if this is NOT the case (and the method is only called once), the same first hint would apply to the core loop:

while(numberofcycles < numsteps) {
    // Allocate the device input data
...        
    // Allocate device output memory
...        
    // Set up the kernel parameters
...
    // Call the kernel function.
...
    // Allocate host output memory...
...
    // Clean up.
...
}

Since the ‚numElements‘ is known before the loop starts, and does not seem to change between the runs, the first step would be to pull the memory allocations and cleanups out of this core loop:

// Allocate the device input data
...        
// Allocate device output memory
...        
while(numberofcycles < numsteps) {

    // Copy the input data into the previously allocated device memory
...        
    // Set up the kernel parameters
...
    // Call the kernel function.
...
    // Copy the output data from the output memory back to Java
...
}
// Clean up.
...

But it might even be the case that there is a tricky way to avoid copying the memory forth and back in each step. This could be the most promising operation, since in many CUDA applications, the memory transfer is the bottleneck. But in order to give more specific hints here, I’ll have to review in detail what is actually done there (with the ‚local‘ list and so on…).

So far,
Marco

In the standard CPU simulation, the Event method is called repeatedly; once for each time step.

In the CUDA simulation, the Event method is called only once, but the internal loop is passed many thousands of times, once for each time step.

OK, then pulling out the memory allocations/freeing might already bring a speedup, but probably not sooo much. It would be more important to try to avoid the memory copies. Do you think it’s possible to modify/adopt the algorithm in a way so that the memory copies can be avoided? This might depend on what is happening with the ‘local’ array afterwards. I can not promise anything, but if you could provide more code (possibly even something that can be compiled and tested) I could try to have a closer look at this during next week.

Ok, I’ll look at the code and try to copy what is relevant to get it working. I’m not certain whether or not the mem copies can be avoided…

I’ll copy exactly what happens with the local array.

Ok, so the first thing that happens to the local array is that its values are used to show which elements from an elements arraylist are within range. In local**, if the integer is 0, the element from elements is not in range and it will move on to the next index, and so on… Those with a ‚1‘ will be added to the local arraylist (owned by the originally selected element at the beginning of Event).



for(int i = 0; i < numElements; i++) {   
                if(Math.abs(hostOutput**) > 0) {
                    Element b = elements.get(i);
                    local.add(b);
                }
            }


All that happens after this is that values possessed by the selected element are tested, and then it may interact with elements in the ‚local‘ arraylist. For example:

for(Element g : local) {
State state2 = g.getState();
Vector vector2 = state2.getVector();
Value value2 = (Value) vector2.get(4);
value2.changeValue(2);
Counter gcounter = g.getStatecounter();
gcounter.increment();
//g.displayState();
}

So elements from ‚local‘ arraylist have their state changed (perhaps), but that is all. After a few similar events, nothing else happens and the arraylist ‚local‘ is left untouched until the next time the element has been selected to act (according to the random selection at the start of the method).

When the element is selected to act again, its local arraylist is cleared and ready to be filled again after distance calculations.

Therefore, for every time step (every pass the the internal loop of Event), CUDA is used to find all of the elements that are in range.

This is done by making an output array from the kernel which is just an array of integers (called ‚local‘ in the kernel) that is the same size as the arraylist of all elements in the simulation at the time. It simply makes each integer at each index position either ‚0‘ or ‚1‘ according to whether or not the element in the elements arraylist, at that index position, is in range or not. Thay way, each element at an index position where there is a ‚1‘ in the local integer array can be added to the element’s local arraylist.

Nothing happens to the local** array after it has been used. It just has the values copied to the host for use.

A complete code for the Event method, just in case it is helpful (but probably not…):



 public void Event(int numsteps) throws IOException
    {
        long Bench = System.currentTimeMillis();
        System.out.println(Bench);
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        
        // Create the PTX file by calling the NVCC
        String ptxFileName = preparePtxFile("CalculateNeighbourhood.cu");
        
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        
        // Load the ptx file.
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);
        
        // Obtain a function pointer to the "add" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "distance");
        
        // Allocate and fill the host input data
        float hostInputX[] = new float[numElements];
        float hostInputY[] = new float[numElements];
        
        int numberofcycles = 0;
        while(numberofcycles < numsteps) {
            global.increment();
            T = global.getCount();
            JLabel timecount = Controls.getTime();
            timecount.setText(" Time step: " + T);
            Movement();
            outputtimer.increment();
            otT = outputtimer.getCount();
            if(otT == 100000) {
                Output output = Output.getOutput();
                output.Output();
                outputtimer.reset();
            }
            
            Random selectelement = new Random();
            theselection = (selectelement.nextInt(concern.size()));
            Element e = concern.get(theselection);
            ArrayList<Element> local = e.getLocal();
            elements.remove(e);
            local.clear();
        
            double eXX = e.getXposition();
            double eYY = e.getYposition();
        
            eX = (float) eXX;
            eY = (float) eYY;
        
            // Fill the float arrays with the correct positional data
            while(elementcount <= (numberofelements - 2)) {
                Element a = elements.get(elementcount);
                float x1;
                float y1;
                double x2;
                double y2;
                x2 = a.getXposition();
                y2 = a.getYposition();
                x1 = (float) x2;
                y1 = (float) y2;
                hostInputX[elementcount] = x1;
                hostInputY[elementcount] = y1;
                elementcount++;
            }
            elementcount = 0;
        
            // Allocate the device input data, and copy the
            // host input data to the device
            CUdeviceptr deviceInputX = new CUdeviceptr();
            cuMemAlloc(deviceInputX, numElements * Sizeof.FLOAT);
            cuMemcpyHtoD(deviceInputX, Pointer.to(hostInputX), numElements * Sizeof.FLOAT);
            CUdeviceptr deviceInputY = new CUdeviceptr();
            cuMemAlloc(deviceInputY, numElements * Sizeof.FLOAT);
            cuMemcpyHtoD(deviceInputY, Pointer.to(hostInputY), numElements * Sizeof.FLOAT);
        
            // Allocate device output memory
            CUdeviceptr deviceOutput = new CUdeviceptr();
            cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT);
        
            // Set up the kernel parameters: A pointer to an array
            // of pointers which point to the actual values.
            Pointer kernelParameters = Pointer.to(
                Pointer.to(new int[]{numElements}),
                Pointer.to(deviceInputX),
                Pointer.to(deviceInputY),
                Pointer.to(new float[]{eX}),
                Pointer.to(new float[]{eY}),
                Pointer.to(deviceOutput)
                );
        
            // Call the kernel function.
            int blockSizeX = 512;
            int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
            cuLaunchKernel(function,
                gridSizeX,  1, 1,      // Grid dimension
                blockSizeX, 1, 1,      // Block dimension
                0, null,               // Shared memory size and stream
                kernelParameters, null // Kernel- and extra parameters
                );
            cuCtxSynchronize();

            // Allocate host output memory and copy the device output
            // to the host.
            float hostOutput[] = new float[numElements];
            cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput, numElements * Sizeof.FLOAT);
        
            for(int i = 0; i < numElements; i++) {   
                if(Math.abs(hostOutput**) > 0) {
                    Element b = elements.get(i);
                    local.add(b);
                }
            }
        
            // Clean up.
            cuMemFree(deviceInputX);
            cuMemFree(deviceInputY);
            cuMemFree(deviceOutput);
        
            elements.add(e);
        
            State state = e.getState();
            Vector vector = state.getVector();
            Value valueBgp = (Value) vector.get(3);
            Value valueBp = (Value) vector.get(5);
            Counter ecounter1 = e.getStatecounter();
            Counter epcounter = e.getepCounter();
            Counter elementinternal = e.getInternalclock();
            Counter omcounter = e.getomCounter();
            elementinternal.increment();
            epcounter.increment();
            omcounter.increment();
            int checkcountepsilon;
            int checkcountomega;
            checkcountepsilon = epcounter.getCount();
            checkcountomega = omcounter.getCount();
        
        
            if(checkcountepsilon == epsilon) {
                Value allvalues = (Value) vector.get(2);
                allvalues.changeValue(0);
                allvalues = (Value) vector.get(3);
                allvalues.changeValue(0);
                allvalues = (Value) vector.get(4);
                allvalues.changeValue(0);
                allvalues = (Value) vector.get(5);
                allvalues.changeValue(0);
                ecounter1.increment();
                epcounter.reset();
            }
        
            if(checkcountomega == omega) {
                Value valueA = (Value) vector.get(0);
                Value valueB = (Value) vector.get(1);
                int A;
                int B;
                A = valueA.getValue();
                B = valueB.getValue();
                Value valueAgp = (Value) vector.get(2);
                valueBgp = (Value) vector.get(3);
                if(A != 0) {
                    valueAgp.changeValue(1);
                    ecounter1.increment();
                }
                if(B != 0) {
                    valueBgp.changeValue(1);
                    ecounter1.increment();
                }
                omcounter.reset();
            }
        
        
            int growingpointb = valueBgp.getValue();
        
            if(growingpointb != 0) {
                int Bp = valueBp.getValue();
                if(Bp == 0) {
                    valueBp.changeValue(35);
                    ecounter1.increment();
                }
            }
        
            int bpheromone = valueBp.getValue();
            if(bpheromone != 1) {
                Diffuse(e);
            }
        
            Value value = (Value) vector.get(2);
            int growingpointa = value.getValue();
            if(growingpointa != 0) {
                Value quantity = (Value) vector.get(4);
                quantity.changeValue(2);
                ecounter1.increment();
                Counter ecounter2 = e.getBroadcastcounter();
                ecounter2.increment();
                //e.displayState();
                for(Element g : local) {
                    State state2 = g.getState();
                    Vector vector2 = state2.getVector();
                    Value value2 = (Value) vector2.get(4);
                    value2.changeValue(2);
                    Counter gcounter = g.getStatecounter();
                    gcounter.increment();
                    //g.displayState();
                }
                Value value3 = (Value) vector.get(3);
                growingpointb = value3.getValue();
                if(growingpointb != 0) {
                    value3.changeValue(0);
                    value.changeValue(0);
                    move = true;
                    ecounter1.increment();
                    //run an output
                    Output output = Output.getOutput();
                    output.Output();
                }
            }
        
            value = (Value) vector.get(2);
            growingpointa = value.getValue();
            if(growingpointa != 0) {
                ArrayList<Element> temp = new ArrayList<Element>();
                for(Element l : local) {
                    temp.add(l);
                }
                for(Element t : temp) {
                    ArrayList<Element> localt = t.getLocal();
                    localt.clear();
                    for(Element a : elements) {
                        double x1;
                        double y1;
                        double x2;
                        double y2;
                        double r;
                        r = t.getRange();
                        x1 = t.getXposition();
                        y1 = t.getYposition();
                        x2 = a.getXposition();
                        y2 = a.getYposition();
                        dist = Math.sqrt(Math.pow(x1 - x2, 2) + Math.pow(y1 - y2, 2));
                        if(dist <= r){
                            localt.add(a);
                        }
                    }
                    for(Element s : localt) {
                        State stateS = s.getState();
                        Vector vectorS = stateS.getVector();
                        Value valueS = (Value) vectorS.get(2);
                        int growingpointaS = valueS.getValue();
                        if(growingpointaS != 0) {
                            State stateT = t.getState();
                            Vector vectorT = stateT.getVector();
                            Value valueT = (Value) vectorT.get(2);
                            int growingpointaT = valueT.getValue();
                            if(growingpointaT == 0) {
                                Value valueT2 = (Value) vectorT.get(5);
                                Value valueS2 = (Value) vectorS.get(5);
                                int bpheromoneT = valueT2.getValue();
                                int bpheromoneS = valueS2.getValue();
                                if(bpheromoneT > bpheromoneS) {
                                    valueS.changeValue(0);
                                    valueT.changeValue(1);
                                    concern.add(t);
                                }
                            }
                        }
                    }
                }
            }
            numberofcycles++;
        }
        
        numberofcycles = 0;
        Bench = System.currentTimeMillis();
        System.out.println(Bench);
        cuCtxDestroy(context);
    }


OK, there’s still a lot to be done apart from the computation of the distances. I can try to have a closer look at this next week, but although I only had a short look at it, I assume that it’s hardly possible to understand without a broader knowledge about the context. Can you say wehther there are other parts (that are really data-parallel) which may benefit from CUDA?

No, I don’t believe there are any other parts that would benefit from CUDA.

I think it may may turn out to be that the implementation of CUDA, in this case, may not be worth it due to the simplicity of the distance calculation (even though it might have to be done for tens of thousands of elements).

Although I’m not really sure.

Thank you for your help. If you need any more information from me, then I can try to explain more.

Well, again, it might be the case that the algorithm can not easily and efficiently be mapped to CUDA. But I mentioned a few possible optimizations, and assume that the Java part is also still far from optimal. But of course I can’t be sure as long as I don’t have the appropriate overview.

Since the local array** inside the kernel is used repeatedly, I think it might be true that the output memory may not necessarily need to be allocated each time the kernel is used. Does that sound right? So, rather than allocate and free the memory each and every time, the array** could simply be updated at each pass? I’m not sure if that makes sense.

Also, looking at the code, the part where the ‘x’ and ‘y’ coordinates are taken from each element and stored in float arrays, for input, could be made to be data-parallel I suppose. I’m not sure if that would mean that two distinct kernels would need to be called to perform the task, though. I’m also not sure if it would be possible to pass the coordinates as an input array without having them stored in an array separately from the list of elements (rather than each pair of coordinates being inside each element).