JOCL running slow?


I’m trying some examples on the JOCL site. And for example the first sample seems to take much longer to run on the GPU than the CPU equivalent for varying ns. Can anyone assist me with this problem?



That’s not really a problem. The JOCLSample is only a basic demonstrationof the usage. The “Vector Addition” example is frequently used for samples, in OpenCL, CUDA and other parallel APIs, so people will recognize it easily.

The reason why it is slower is that there is hardly any computation involved. Most of the time is used for copying the memory to the device, and afterwards copying the result memory back to the host. The only “computation” that is done is a simple multiplication, which does not outweigh the memory transfer times. The computation is “memory bound”, which roughly means that the ratio between “amount of data that is transferred” and “amount of computation that is performed on the data” is very high.

Objectively comparing OpenCL and Java is hard, for several reasons. Did you include the time for the OpenCL setup in your measurements? And/or the time for the memory allocation? And/or the time for the memory copies? What’s the influence of Javas JIT?

In general, you can tweak most benchmarks in order to show what you want them to show. The following example should not be taken serious, but shows what happens when you add some arbitrary (but expensive) trigonometric computations, and omit the time for OpenCL setup and memroy allocations in the measurements:

package org.jocl.samples;

import static org.jocl.CL.*;

import org.jocl.*;

public class JOCLPseudoBenchmark
     * The source code of the OpenCL program to execute
    private static String programSource =
        "__kernel void "+
        "sampleKernel(__global const float *a,"+
        "             __global const float *b,"+
        "             __global float *c)"+
        "    int gid = get_global_id(0);"+
        "    c[gid] = cos(sin(cos(sin(a[gid] * b[gid]))));"+

    private static void computeJava(int n, float srcArrayA[], float srcArrayB[], float dstArray[])
        for (int i=0; i<n; i++)
            dstArray[i] = (float)Math.cos(Math.sin(Math.cos(Math.sin(srcArrayA[i] * srcArrayB[i])))); 
    private static cl_context context;
    private static cl_command_queue commandQueue;
    private static cl_kernel kernel;
    private static cl_mem srcMemA;
    private static cl_mem srcMemB;
    private static cl_mem dstMem;
    public static void main(String args[])
        int n = 100000;
        float srcArrayA[] = new float[n];
        float srcArrayB[] = new float[n];
        float dstArray[] = new float[n];
        for (int i=0; i<n; i++)
            srcArrayA[i] = i;
            srcArrayB[i] = i;
        long before = 0;
        long after = 0;
        int runs = 10;
        before = System.nanoTime();
        for (int i=0; i<runs; i++)
            computeJava(n, srcArrayA, srcArrayB, dstArray);
        after = System.nanoTime();
        System.out.println("Java "+(after-before)*1e-6+" ms");

        before = System.nanoTime();
        for (int i=0; i<runs; i++)
            computeJOCL(n, srcArrayA, srcArrayB, dstArray);
        after = System.nanoTime();
        System.out.println("JOCL "+(after-before)*1e-6+" ms");
    private static void coreSetup()
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        // Create a command-queue for the selected device
        commandQueue = 
            clCreateCommandQueue(context, device, 0, null);

        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
            1, new String[]{ programSource }, null, null);
        // Build the program
        clBuildProgram(program, 0, null, null, null, null);
        // Create the kernel
        kernel = clCreateKernel(program, "sampleKernel", null);
    private static void coreShutdown()
    private static void memorySetup(int size)
        srcMemA = clCreateBuffer(context, 
            Sizeof.cl_float * size, null, null);
        srcMemB = clCreateBuffer(context, 
            Sizeof.cl_float * size, null, null);
        dstMem = clCreateBuffer(context, 
            Sizeof.cl_float * size, null, null);

    private static void memoryShutdown()
    static void computeJOCL(int n, float srcArrayA[], float srcArrayB[], float dstArray[])
        Pointer srcA =;
        Pointer srcB =;
        Pointer dst =;

        clEnqueueWriteBuffer(commandQueue, srcMemA, CL_TRUE, 0,
            n * Sizeof.cl_float, srcA, 0, null, null);
        clEnqueueWriteBuffer(commandQueue, srcMemB, CL_TRUE, 0,
            n * Sizeof.cl_float, srcB, 0, null, null);
        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0, 
        clSetKernelArg(kernel, 1, 
        clSetKernelArg(kernel, 2, 
        // Set the work-item dimensions
        long global_work_size[] = new long[]{n};
        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, null, 0, null, null);
        // Read the output data
        clEnqueueReadBuffer(commandQueue, dstMem, CL_TRUE, 0,
            n * Sizeof.cl_float, dst, 0, null, null);


Thanks a lot for that help and especially the sample code. Is there any way to know whether a program would benefit from using openCL. At my school(research) they want me to try to migrate large amounts of Java code to openCL since it’s working on large datasets and they believe it will help distribute the workload but if it won’t speed it up it’s not worth it.


Of course that heavily depends on the nature of the application.

In general, there are two conditions that should be fulfilled in order to really benefit from porting a computation to the GPU:

First of all, the tasks have to be data-parallel. That roughly means: Every thread is doing the same, but only on different parts or areas of a large data block. An OpenCL kernel is virtually executed by thousands of threads. But these threads are all doing the same. They only access different elements of the input data, and these elements are identified by the ‘get_global_id’ of the thread.
In the best case, each element of a larger data block can be treated independently of each other.

The second condition is that the tasks are compute-bound. This is the case when the computation that is required for every element is arithmetically expensive (conversely to the explaination of the ‘memory bound’ nature of the original sample).

These statements are put simple and sound very strict in this form. Of course, there are no clear borders, so these conditions may only serve as a guideline. There are many tasks which are “not perfectly data parallel” or “not very compute intensive” but may nevertheless benefit from porting them to the GPU.

There are also problems which seem to be inherently sequential at the first glance, but may be ported to the GPU anyhow, and there are many tricks and sophisticated techniques for optimizations. The Reduction example from the CUDA SDK contains a PDF that shows several techniques. Although this applies to CUDA, most of them are also applicable in OpenCL. But it quickly gets complicated, and for first experiments, you are best off when you don’t have to cope with these details too much.

You should first of all identify the bottleneck of your application. Then you could carefully examine whether these parts are really data-parallel and compute intensive. And if this is the case, you could try to create an OpenCL implementation for the really time-consuming core of the computation (and this core should preferably be small and well defined, and not consist of the whole application…)

I personally found it helpful to first write a Java version of the core methods, which used the “style” of OpenCL kernels: Static methods that receive all their input in form of plain arrays. These may help identifying the core computation, and easily be ported to OpenCL kernels. And… since there is no such thing like an “ArrayIndexOutOfBoundsException” in OpenCL, such methods may also turn out to be very helpful for debugging :wink: