BUG in JOCL?

Hi all,

Here is a strange behaviour:

    - working with XXX < 16 but not over (17, 18,..)
    - locaTab big enough
    - blockSize = 64

tab[get_local_id(0)] = 9;

if (get_local_id(0) < XXX)
localTab[get_local_id(0)] = localTab[get_local_id(0) + XXX];

barrier(CLK_LOCAL_MEM_FENCE);

a = localTab[get_local_id(0)];

To my mind, the only explanation is barrier(CLK_LOCAL_MEM_FENCE) don’t block all block threads.
With XXX = 16, warp implicit barrier ensure good behaviour…

Am I right?

any ideas??

with some corrections

  • localTab is declared in shared memory

localTab[get_local_id(0)] = 9;

if (get_local_id(0) < XXX)
localTab[get_local_id(0)] = localTab[get_local_id(0) + XXX];

barrier(CLK_LOCAL_MEM_FENCE);

a = localTab[get_local_id(0)];

Sorry, I did not really get the point… What exactly is your question? For a line like
localTab[get_local_id(0)] = localTab[get_local_id(0) + XXX];
where does the value in ‘localTab’ at index ‘get_local_id(0) + XXX’ come from?

Hi marco

So imagine all values in array are set to 9. Thread Ids start from 0.
What I highlight in this example is when XXX = 17, last thread in if condition (thread 16) read value at index 33
while thread 33 try to read it at the same time (barrier has no effect).

Finally, with XXX = 16, I can read a for each thread, while with XXX = 17, value a for thread 16 is corrupted

Hi

I might have understood what you mean, although I’m not entirely sure.

To my understanding, the code should be valid, and I can not think of any reason why the value ‘a’ should be corrupted: The barrier can only passed by any work-item when all others have reached the barrier, so all values in the localTab should be filled properly. IF it really is corrupted, and IF it really happens exactly when XXX>16, then it sounds like it might be related to the warp size, but of course, in OpenCL, you should not rely on something like that. The question whether this is related to the warp size goes into (possibly platform specific) details, and you might consider asking about this in the NVIDIA- or AMD forum.

In any case, since this seems to be related solely to the kernel code and the barrier behavior, I think it should not be a bug in JOCL. JOCL only passes the calls to the underlying OpenCL implementation, and does not touch any internals of the kernel.

However, I’d be interested in a small (preferably compileable and easily testable) example which could be used to reproduce this error. If you cannot provide one, I might try to build one, but I’m not sure when I will have the time.

bye
Marco

Thanks Marco for your quick answer…

It is really easy to reproduce it…

Just copy past it in one of your existing project…
You need a groupSize over warpSize (for instance 64 if NVIDIA GPU)
You need localArray (for instance length of 64) in shared memory


localArray[get_local_id(0)] = 9;

//16 OK 17 KO
if (get_local_id(0) < 17)
localArray[get_local_id(0)] = localArray[get_local_id(0) + 17];

barrier(CLK_LOCAL_MEM_FENCE);

int aReg = localArray[get_local_id(0)];

I can try to reproduce it later. A assume that ‘aReg’ is then written to global memory which is the copied back in order to verify its value?

Yes (or directly)

globalOutput[get_global_id(0)] = localArray[get_local_id(0)];

From a first test, it seems to work, but presumably that does not mean so much: It also worked without any barrier… As usual, any concurrency bugs are hard or impossible to reproduce, but in any case, I think with the barrier it should be formally correct (don’t pin me down to this, it’s only due to my understanding until now).

However, here’s the test case that I used, on a good old GeForce 8800. Maybe you can test it, spot potential differences or modify it so that the error occurs again

import static org.jocl.CL.*;

import java.util.Arrays;

import org.jocl.*;

public class LocalMemoryTest
{
    private static final String programSource =
        "__kernel void test(" +"
"+
        "     __global int *globalArray," +"
"+
        "    __local int *localArray)"+"
"+
        "{"+"
"+
        "    localArray[get_local_id(0)] = get_local_id(0);"+"
"+
        ""+"
"+
        "    barrier(CLK_LOCAL_MEM_FENCE);"+"
"+
        ""+"
"+
        "    int offset = 17; //16 OK 17 KO"+"
"+
        "    if (get_local_id(0) < offset)"+"
"+
        "        localArray[get_local_id(0)] = localArray[get_local_id(0) + offset];"+"
"+
        ""+"
"+
        "    barrier(CLK_LOCAL_MEM_FENCE);"+"
"+
        ""+"
"+
        "    int aReg = localArray[get_local_id(0)];"+"
"+        
        "    globalArray[get_local_id(0)] = aReg;"+"
"+
        "}";
    private static final String kernelName = "test";

    private static cl_context context;
    private static cl_command_queue commandQueue;
    private static cl_program program;
    private static cl_kernel kernel;
    
    public static void main(String args[])
    {
        defaultInitialization();
        
        int n = 64;
        int array[] = new int[n];
        cl_mem mem = clCreateBuffer(context, CL_MEM_USE_HOST_PTR, 
            n * Sizeof.cl_int, Pointer.to(array), null);
        
        int a = 0;
        clSetKernelArg(kernel, a++, Sizeof.cl_mem, Pointer.to(mem));
        clSetKernelArg(kernel, a++, n * Sizeof.cl_int, null);
        
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, 
            new long[]{n}, new long[]{n}, 0, null, null);
        
        clEnqueueReadBuffer(commandQueue, mem, CL_TRUE, 0, 
            n * Sizeof.cl_int, Pointer.to(array), 0, null, null);

        System.out.println("Result "+Arrays.toString(array));
    }
    
    private static void defaultInitialization()
    {
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        
        // Create a command-queue for the selected device
        commandQueue = 
            clCreateCommandQueue(context, device, 0, null);
        
        // Create the program
        cl_program program = clCreateProgramWithSource(context, 
            1, new String[]{ programSource }, null, null);
        
        // Build the program
        clBuildProgram(program, 0, null, null, null, null);
        
        // Create the kernel
        kernel = clCreateKernel(program, kernelName, null);
    }
    
}

SHAME ON ME

I forgotten your first barrier (but It was in my initial code)

Before closing the post, I check my code…

MANY thanks Marco

All work fine

Thank you marco