Opencl loops in kernel

In the code block there is my kernel function. It essentialy calculates which point is the farthest from all clusters and results are saved in lengths[3] (id of the point) and output[0] the distance from the belonging cluster. The while piece does a simple sum reduction. I know it is not the best method to do but I need to understand why having one cluster the code works properly insteas with two or more clusters return wrong values.

__kernel void computeDistances(__global t_cluster *points,__global t_cluster *clusters,     __global float *output,__global t_cluster *support,__global short *lengths)
    int threadId = get_global_id(0);
    float bestVal = 0;
    int counter, offset;

    short idPoint, idCluster;
    for(idPoint = 0; idPoint < lengths[0]; idPoint++)

        for(idCluster = 0; idCluster < lengths[2]; idCluster++)
            support[0].attributes[threadId] = pow( (points[idPoint].attributes[threadId] - clusters[idCluster].attributes[threadId]) , 2 );

            counter = SIZE;
            offset = 1;

            while(counter != 1)
                counter = counter / 2 + (counter % 2);


                if(threadId % (2*offset) == 0)
                    if(threadId + offset < lengths[1])
                        support[0].attributes[threadId] = support[0].attributes[threadId] + support[0].attributes[threadId+offset];

                offset = offset * 2 ;


            if(support[0].attributes[threadId] > bestVal)
                bestVal = support[0].attributes[threadId];



    if(threadId == 0 && bestVal > output[threadId])
        output[0] = bestVal;
        lengths[3] = idPoint;



It’s hard to tell something by just looking over the code, and without knowing the structures and the input data. It seems like you have tried to perform a “parallel reduction” on arrays that are fields of structs that are themself elements of an array. (This looks suspicious for me, and IF this works at all, I’m pretty sure that the globaly memory fences in there will hinder any performance gains, but I’m not an expert and have not analyzed the kernel in detail, so I may be wrong here :stumm: ). I also wonder how you fill these input arrays in the first place.

However, a first debugging step could be to replace this reduction with a simple for-loop, this should be easier to get right. Maybe I can test this on Monday/Tuesday, but I can’t promise.


(EDIT: Just as a cross-reference: )

Sorry, I forgot I posted something here!!

Now, I would like to post my experience! The problem here (and during also other kernels) were to find in scalar processor. My gpu has 8 SP. so when I ran over that I could ran just 8 workgroups.
This means that 8 workgroups ran all the kernel and finished modifying one address of global memory. My idea, in this kernel, was that, using last barrier, each workgroup compared a value and write the right result but this is an error because, during comparing phase, other thread havent been ran yet, not having the right value.

I hope to be cleared.

Although I have to admit that I’m still not sure whether I understood your actual problem, maybe someone who as the same problem will stumple upon this (and the stackoverflow question) and find some valuable information.

Maybe I should be more clear anyway I resolved in this way.