Just as topic says I’m trying to learn how to synchronize my threads, however it doesn’t seem to work.
Test subject is counting sort kernel and i don’t want to use secondary array to write the output, I just wan’t to suspend every thread after counting and after every thread is done with it I want to write the output to original array.
{
int i = get_global_id(0);
int n = get_global_size(0);
uint iKey = in**;
int pos = 0;
for (int j=0;j<n;j++)
{
uint jKey = in[j];
bool smaller = (jKey < iKey) || (jKey == iKey && j < i);
pos += (smaller)?1:0;
}
barrier(CLK_GLOBAL_MEM_FENCE); //suspend here until all threads gets here after counting
//out[pos] = iKey; //normally I would write it to the output array
in[pos] = iKey; //write to original array
}```
Is there anything I should do about this outside in queue api or sth?
EDIT:
I found that barrier works only for work group items, yet work group size is limited and probably device/architecture specific...