Hi all, i’ve just started working with JOCL and i’m trying to get my head around the processing speed. I’ve looked at the JOCLSimpleConvolution sample and i’m trying to figure out exactly what is happening and the speed comparison between Java and JOCL.
When I run the edge detection kernel on a 1024x512 image the timing is as follows
Java : 18.75 ms
JOCL : 40.59 ms.
I know that there is an overhead involved in the OpenCL calls, so I wanted to see exactly where the delay in the JOCL came about from, so i decided to see where the time delay arose. To do this I looked within the jOCLConvolveOP class. The OpenCL is called within the
filter(BufferedImage src, BufferedImage dst)
function. So i looked into this and performed some timeing using a simple event timer tool that I have (you use mark(id) to start a timer and tick(id) to stop timing).
The code i used is
EventTimer t= new EventTimer();
t.mark("Total Time");
t.mark("Validity checks");
// Validity checks for the given images
if (src.getType() != BufferedImage.TYPE_INT_RGB)
{
throw new IllegalArgumentException(
"Source image is not TYPE_INT_RGB");
}
if (dst == null)
{
dst = createCompatibleDestImage(src, null);
} else if (dst.getType() != BufferedImage.TYPE_INT_RGB)
{
throw new IllegalArgumentException(
"Destination image is not TYPE_INT_RGB");
}
if (src.getWidth() != dst.getWidth()
|| src.getHeight() != dst.getHeight())
{
throw new IllegalArgumentException(
"Images do not have the same size");
}
t.tick("Validity checks");
int imageSizeX = src.getWidth();
int imageSizeY = src.getHeight();
t.mark("Grabbing Src Buffer");
// Create the memory object for the input-
// and output image
DataBufferInt dataBufferSrc = (DataBufferInt) src.getRaster()
.getDataBuffer();
int dataSrc[] = dataBufferSrc.getData();
t.tick("Grabbing Src Buffer");
t.mark("Create Input CL Memory");
inputImageMem = clCreateBuffer(context, CL_MEM_READ_ONLY
| CL_MEM_USE_HOST_PTR, dataSrc.length * Sizeof.cl_uint, Pointer.to(dataSrc), null);
t.tick("Create Input CL Memory");
t.mark("Create OUT CL Memory");
outputImageMem = clCreateBuffer(context, CL_MEM_WRITE_ONLY, imageSizeX
* imageSizeY * Sizeof.cl_uint, null, null);
t.tick("Create OUT CL Memory");
t.mark("Get Kernal Parms");
// Set work sizes and arguments, and
// execute the kernel
int kernelSizeX = kernel.getWidth();
int kernelSizeY = kernel.getHeight();
int kernelOriginX = kernel.getXOrigin();
int kernelOriginY = kernel.getYOrigin();
long localWorkSize[] = new long[2];
localWorkSize[0] = kernelSizeX;
localWorkSize[1] = kernelSizeY;
long globalWorkSize[] = new long[2];
globalWorkSize[0] = round(localWorkSize[0], imageSizeX);
globalWorkSize[1] = round(localWorkSize[1], imageSizeY);
int imageSize[] = new int[] { imageSizeX, imageSizeY };
int kernelSize[] = new int[] { kernelSizeX, kernelSizeY };
int kernelOrigin[] = new int[] { kernelOriginX, kernelOriginY };
t.tick("Get Kernal Parms");
t.mark("Pass Kernal Args");
clSetKernelArg(clKernel, 0, Sizeof.cl_mem, Pointer.to(inputImageMem));
clSetKernelArg(clKernel, 1, Sizeof.cl_mem, Pointer.to(kernelMem));
clSetKernelArg(clKernel, 2, Sizeof.cl_mem, Pointer.to(outputImageMem));
clSetKernelArg(clKernel, 3, Sizeof.cl_int2, Pointer.to(imageSize));
clSetKernelArg(clKernel, 4, Sizeof.cl_int2, Pointer.to(kernelSize));
clSetKernelArg(clKernel, 5, Sizeof.cl_int2, Pointer.to(kernelOrigin));
t.tick("Pass Kernal Args");
t.mark("Enque Kernal");
clEnqueueNDRangeKernel(commandQueue, clKernel, 2, null, globalWorkSize, localWorkSize, 0, null, null);
clEnqueueBarrier(commandQueue);
t.tick("Enque Kernal");
t.mark("Get Output Holder");
// Read the pixel data into the
// BufferedImage
DataBufferInt dataBufferDst = (DataBufferInt) dst.getRaster()
.getDataBuffer();
int dataDst[] = dataBufferDst.getData();
t.tick("Get Output Holder");
t.mark("Read Output");
clEnqueueReadBuffer(commandQueue, outputImageMem, CL_TRUE, 0, dataDst.length
* Sizeof.cl_uint, Pointer.to(dataDst), 0, null, null);
t.tick("Read Output");
t.mark("Cleanup");
// Clean up
clReleaseMemObject(inputImageMem);
clReleaseMemObject(outputImageMem);
t.tick("Cleanup");
t.tick("Total Time");
t.printData(); //Prints timeing data
When I ran the program i got the following times
Total Time :39.424 ms
Validity checks :0.0 ms
Grabbing Src Buffer:0.0 ms
Create Input CL Memory :0.0 ms
Create OUT CL Memory :0.0 ms
Get Kernal Parms :0.0 ms
Pass Kernal Args :0.0 ms
Enque Kernal :0.0 ms
Get Output Holder :0.0 ms
Read Output :38.400 ms
Cleanup :0.512 ms
What I dont understand is why is it only when I call the clEnqueueReadBuffer is the kernel apparently called, as this is taking the most of the processing time?
Even tho I call clEnqueueBarrier(commandQueue) after I call the clEnqueueNDRangeKernel() command, which I would have assumed meant that the program should wait for the kernal to be run, is it then the case that the program is truly taking 38 ms just to stream the data back from the GPU. Its just confusing me.
Any help on this would be much appreciated.
Regards
Joey