Explanation worksize



In one of the sample files JOCLSimpleConvolution.java starting from line 559 there are some different sorts of worksizes being mentioned. GlobalWorksize, LocalWorkSize and also a round function that is being used for one of them.
I’m really confused on what this part of the code does to the program. I’m VERY new to this subject and i’m trying to do this for a schoolproject, but that part of the code is bugging me ever since i started.

Could someone explain me this part of the code so i can get a good understanding of how this works?

Thx in advance!



Maybe the short answer first: In this example, the local work size and rounding of the global work size are not really necessary. You can set the localWorkSize to ‘null’, and as the global work size, simply use the image size:

clEnqueueNDRangeKernel(commandQueue, clKernel, 2, null, 
            new long[]{imageSizeX, imageSizeY}, null, 0, null, null);

The reason why the local work size is explicitly set to be the ‘convolution kernel’ size is that I experimented a little with local memory when I created this sample. I was playing around with different possible implementations of the Convolution OpenCL Kernel. The one that found its way onto the website is the most simple one (that’s why it does not really need a specification of the local work size). Other, more sophisticated kernels, may need it.

I’m also not an expert at OpenCL/GPU programming, and some aspects are hard to grasp. Admittedly, the real details concerning warp sizes, half-warps and all that, what is explained in the NVIDIA Whitepapers (mainly those referring to CUDA) are bugging me as well - I think that, to some extent, this should be “hidden” by OpenCL. Otherwise you always risk programming/optimizing your kernel for a specific GPU architecture.

However, I’ll try a simple explaination of what is done in the current convolution example:

The local work size is currently set to be the same as the size of the Convolution Kernel. A more sophisticated implementation of the OpenCL kernel could now exploit this information, and load a small block of the image into the local memory. Namely a block that has the size of the convolution kernel. When the data is in the local memory, it should be possible to perform the convolution more efficiently.

The “globalWorkSize” is usually the size of the problem itself - for example, the size of the image. If the image has a size of 342 * 129 pixels, then the globalWorkSize could be [342, 129]. BUT there is a constraint for the global work size: It has to be a multiple of the localWorkSize. Therefore, in the sample, the ‘round’ function is used, to make sure that the global work size is at least as large as the image, but still a multiple of the local work size.

If you are going to create a school project, please keep in mind that this is only a sample. Depending on what your goal is in this project, other approaches for a convolution may be more appropriate.

You might want to have a look at tutorials like Image Convolution Using OpenCL™ - A Step-by-Step Tutorial from AMD, do a websearch for “separable convolution”, and look at the CUDA/OpenCL implementations of Convolutions that come with the OpenCL SDKs.