Opencl loops in kernel

system · 5. April 2014 um 05:24

In the code block there is my kernel function. It essentialy calculates which point is the farthest from all clusters and results are saved in lengths[3] (id of the point) and output[0] the distance from the belonging cluster. The while piece does a simple sum reduction. I know it is not the best method to do but I need to understand why having one cluster the code works properly insteas with two or more clusters return wrong values.

__kernel void computeDistances(__global t_cluster *points,__global t_cluster *clusters,     __global float *output,__global t_cluster *support,__global short *lengths)
{
    int threadId = get_global_id(0);
    float bestVal = 0;
    int counter, offset;

    short idPoint, idCluster;
    for(idPoint = 0; idPoint < lengths[0]; idPoint++)
    {

        for(idCluster = 0; idCluster < lengths[2]; idCluster++)
        {     
            support[0].attributes[threadId] = pow( (points[idPoint].attributes[threadId] - clusters[idCluster].attributes[threadId]) , 2 );

            counter = SIZE;
            offset = 1;

            while(counter != 1)
            {
                counter = counter / 2 + (counter % 2);

                barrier(CLK_GLOBAL_MEM_FENCE);

                if(threadId % (2*offset) == 0)
                    if(threadId + offset < lengths[1])
                        support[0].attributes[threadId] = support[0].attributes[threadId] + support[0].attributes[threadId+offset];

                offset = offset * 2 ;
             }

             barrier(CLK_GLOBAL_MEM_FENCE);

            if(support[0].attributes[threadId] > bestVal)
                bestVal = support[0].attributes[threadId];

    }

    barrier(CLK_GLOBAL_MEM_FENCE);

    if(threadId == 0 && bestVal > output[threadId])
    {
        output[0] = bestVal;
        lengths[3] = idPoint;
    }
}

}

Marco13 · 5. April 2014 um 08:40

Hi

It’s hard to tell something by just looking over the code, and without knowing the structures and the input data. It seems like you have tried to perform a “parallel reduction” on arrays that are fields of structs that are themself elements of an array. (This looks suspicious for me, and IF this works at all, I’m pretty sure that the globaly memory fences in there will hinder any performance gains, but I’m not an expert and have not analyzed the kernel in detail, so I may be wrong here :stumm: ). I also wonder how you fill these input arrays in the first place.

However, a first debugging step could be to replace this reduction with a simple for-loop, this should be easier to get right. Maybe I can test this on Monday/Tuesday, but I can’t promise.

bye
Marco

(EDIT: Just as a cross-reference: http://stackoverflow.com/questions/22880649/opencl-loops-in-kernel )

system · 22. Mai 2014 um 07:17

Sorry, I forgot I posted something here!!

Now, I would like to post my experience! The problem here (and during also other kernels) were to find in scalar processor. My gpu has 8 SP. so when I ran over that I could ran just 8 workgroups.
This means that 8 workgroups ran all the kernel and finished modifying one address of global memory. My idea, in this kernel, was that, using last barrier, each workgroup compared a value and write the right result but this is an error because, during comparing phase, other thread havent been ran yet, not having the right value.

I hope to be cleared.

Marco13 · 22. Mai 2014 um 08:38

Although I have to admit that I’m still not sure whether I understood your actual problem, maybe someone who as the same problem will stumple upon this (and the stackoverflow question) and find some valuable information.

manuel · 22. Mai 2014 um 08:42

Maybe I should be more clear anyway I resolved in this way.