Speed of JOCLSimpleConvolution

Hi all, i’ve just started working with JOCL and i’m trying to get my head around the processing speed. I’ve looked at the JOCLSimpleConvolution sample and i’m trying to figure out exactly what is happening and the speed comparison between Java and JOCL.
When I run the edge detection kernel on a 1024x512 image the timing is as follows
Java : 18.75 ms
JOCL : 40.59 ms.

I know that there is an overhead involved in the OpenCL calls, so I wanted to see exactly where the delay in the JOCL came about from, so i decided to see where the time delay arose. To do this I looked within the jOCLConvolveOP class. The OpenCL is called within the

filter(BufferedImage src, BufferedImage dst) 

function. So i looked into this and performed some timeing using a simple event timer tool that I have (you use mark(id) to start a timer and tick(id) to stop timing).

The code i used is


        EventTimer t=  new EventTimer();
        t.mark("Total Time");

        t.mark("Validity checks");
        // Validity checks for the given images
        if (src.getType() != BufferedImage.TYPE_INT_RGB)
        {
            throw new IllegalArgumentException(
                    "Source image is not TYPE_INT_RGB");
        }
        if (dst == null)
        {
            dst = createCompatibleDestImage(src, null);
        } else if (dst.getType() != BufferedImage.TYPE_INT_RGB)
        {
            throw new IllegalArgumentException(
                    "Destination image is not TYPE_INT_RGB");
        }
        if (src.getWidth() != dst.getWidth()
                || src.getHeight() != dst.getHeight())
        {
            throw new IllegalArgumentException(
                    "Images do not have the same size");
        }
        t.tick("Validity checks");
        
        int imageSizeX = src.getWidth();
        int imageSizeY = src.getHeight();



        t.mark("Grabbing Src Buffer");
        // Create the memory object for the input-
        // and output image
        DataBufferInt dataBufferSrc = (DataBufferInt) src.getRaster()
                .getDataBuffer();
        int dataSrc[] = dataBufferSrc.getData();
        t.tick("Grabbing Src Buffer");
        


        t.mark("Create Input CL Memory");
        inputImageMem = clCreateBuffer(context, CL_MEM_READ_ONLY
                | CL_MEM_USE_HOST_PTR, dataSrc.length * Sizeof.cl_uint, Pointer.to(dataSrc), null);
        t.tick("Create Input CL Memory");
        

        t.mark("Create OUT CL  Memory");
        outputImageMem = clCreateBuffer(context, CL_MEM_WRITE_ONLY, imageSizeX
                * imageSizeY * Sizeof.cl_uint, null, null);
        t.tick("Create OUT CL  Memory");
        
        t.mark("Get Kernal Parms");
        // Set work sizes and arguments, and
        // execute the kernel
        int kernelSizeX = kernel.getWidth();
        int kernelSizeY = kernel.getHeight();
        int kernelOriginX = kernel.getXOrigin();
        int kernelOriginY = kernel.getYOrigin();

        long localWorkSize[] = new long[2];
        localWorkSize[0] = kernelSizeX;
        localWorkSize[1] = kernelSizeY;

        long globalWorkSize[] = new long[2];
        globalWorkSize[0] = round(localWorkSize[0], imageSizeX);
        globalWorkSize[1] = round(localWorkSize[1], imageSizeY);

        int imageSize[] = new int[] { imageSizeX, imageSizeY };
        int kernelSize[] = new int[] { kernelSizeX, kernelSizeY };
        int kernelOrigin[] = new int[] { kernelOriginX, kernelOriginY };
        t.tick("Get Kernal Parms");
        

        t.mark("Pass Kernal Args");
        clSetKernelArg(clKernel, 0, Sizeof.cl_mem, Pointer.to(inputImageMem));
        clSetKernelArg(clKernel, 1, Sizeof.cl_mem, Pointer.to(kernelMem));
        clSetKernelArg(clKernel, 2, Sizeof.cl_mem, Pointer.to(outputImageMem));
        clSetKernelArg(clKernel, 3, Sizeof.cl_int2, Pointer.to(imageSize));
        clSetKernelArg(clKernel, 4, Sizeof.cl_int2, Pointer.to(kernelSize));
        clSetKernelArg(clKernel, 5, Sizeof.cl_int2, Pointer.to(kernelOrigin));
        t.tick("Pass Kernal Args");


        t.mark("Enque Kernal");
        clEnqueueNDRangeKernel(commandQueue, clKernel, 2, null, globalWorkSize, localWorkSize, 0, null, null);
        clEnqueueBarrier(commandQueue);
        t.tick("Enque Kernal");
        
        
        t.mark("Get Output Holder");
        // Read the pixel data into the
        // BufferedImage
        DataBufferInt dataBufferDst = (DataBufferInt) dst.getRaster()
                .getDataBuffer();
        int dataDst[] = dataBufferDst.getData();
        t.tick("Get Output Holder");
        

        t.mark("Read Output");
        clEnqueueReadBuffer(commandQueue, outputImageMem, CL_TRUE, 0, dataDst.length
                * Sizeof.cl_uint, Pointer.to(dataDst), 0, null, null);
        t.tick("Read Output");
        

        t.mark("Cleanup");
        // Clean up
        clReleaseMemObject(inputImageMem);
        clReleaseMemObject(outputImageMem);
        t.tick("Cleanup");


        t.tick("Total Time");
        
        t.printData(); //Prints timeing data


When I ran the program i got the following times
Total Time :39.424 ms

Validity checks :0.0 ms
Grabbing Src Buffer:0.0 ms
Create Input CL Memory :0.0 ms
Create OUT CL Memory :0.0 ms
Get Kernal Parms :0.0 ms
Pass Kernal Args :0.0 ms
Enque Kernal :0.0 ms
Get Output Holder :0.0 ms
Read Output :38.400 ms
Cleanup :0.512 ms

What I dont understand is why is it only when I call the clEnqueueReadBuffer is the kernel apparently called, as this is taking the most of the processing time?

Even tho I call clEnqueueBarrier(commandQueue) after I call the clEnqueueNDRangeKernel() command, which I would have assumed meant that the program should wait for the kernal to be run, is it then the case that the program is truly taking 38 ms just to stream the data back from the GPU. Its just confusing me.

Any help on this would be much appreciated.
Regards
Joey

Hello,

I know that the OpenCL version is not in all cases faster than the Java implementation. There are several aspects to consider:

I wanted the JOCL Convolution to be a BufferedImageOp, just test (or demonstrate) that tasks can be replaced by their OpenCL equivalents without modifying much of the remaining code. Once you anticipate the processes and workflow of OpenCL, you may save some time. For example, it could be possible to pull much of the setup (createBuffer, setKernelArg and releaseMemObject) out of the „filter“ method, so that it only contains the kernel call, or maybe the writeBuffer-enqueueKernel-readBuffer sequence.

Another important point is that the actual convolution kernel is trivial, and thus far (!) from optimal. There are sophisticated mechanisms for optimizing this. One could be to use Images instead of buffers. They are cached, and could dramatically speed up the whole thing. But for quite a while, Images had not been supported by all OpenCL implementations. Recently, Image support became part of the most important implementations, and I already considered to update the convolution example to use images, but this has not yet reached the top of my TODO-list :wink: Other strategies, like local memory, could also be worth a try, not to mention the really sophisticated ones like the separable convolution or even an FFT-based approach based on the AMD OpenCL FFT library, which would be much faster for large kernels.

By the way, the timer you used may not to be very precise. The duration for most operations may be small, maybe only a few microseconds, but not 0.0. Are you using System.nanoTime() internally?

Finally, the clEnqueueBarrier does not wait for the kernel to be completed. It just acts as a barrier, and I think it only makes sense when the out-of-order execution mode is enabled. To wait for the queued commands to be completed, use
clFinish(commandQueue);
instead.

I’ll try to allocate some time for updating the example to use Images and see how fast it can get with reasonable efforts, but I can not give a precise date when this will be possible.

bye
Marco

Hi,
I just started playing around with JOCL and run the JOCLSample.java (found in your web) once with context configured for GPU and a second time configured for the CPU. The result is that the configuration with CPU is consistently twice faster than the GPU. I’d like to see an example that shows clear advantage of the GPU over the CPU.
When I run JOCLDeviceQuery.java it shows for the device Geforce
CL_DEVICE_MAX_COMPUTE_UNITS: 4
Is this the number of available cores available in the graphics card for processing. I found in NVidia’s web that this GC has 32 stream processors. Is that normal? May I configure the OpenCL driver to use more processors? This may be the reason for been slower.
I use a macbook pro Intel Core 2 Duo 64bits and my graphics card is a GeForce 8600M GT.
Thanks
Javi

Hello

One „compute unit“ has 8 „streaming processors“. The naming may be confusing, this may have something to do… either with marketing, or with hardware details that I’m not familiar with.

Concerning the speed compared to a CPU: First of all: How are you measuring the speed? This may be tricky, depending on the setup.

In any case, the CPU is a pretty fast one, and the GPU is, as far as I know, a relatively slow one (a notebook GC is usually not one of these monsters … you know, where the cooling fan needs its own power supply unit…). So the speedup may not be so dramatic in this case.

I’d like to see an example that shows clear advantage of the GPU over the CPU

The simple solution: Increase the size of the buffers in the example from 10 to 100000, and replace the line
c[gid] = a[gid] * b[gid]
with a line like
c[gid] = sin(cos(tan(a[gid]))) * sin(cos(tan(b[gid])));
:wink:

But seriously: OpenCL is not primarily about „taking the same code and execute it on the GPU, because it is faster“. OpenCL offers the facilities for heterogeneous computing, and primarily: data parallel computing. It’s a different way of formulating problems and designing algorithms (and I’m also just a beginner, so don’t expect any reference solutions from me). But, for example, in the JOCLSimpleMandelbrot example, it might be possible to see a speedup, because this is an „embarassingly parallel“ problem and should fit nicely on the GPU.

bye