How to use barrier, mem fence or other sync methods?

Just as topic says I’m trying to learn how to synchronize my threads, however it doesn’t seem to work.

Test subject is counting sort kernel and i don’t want to use secondary array to write the output, I just wan’t to suspend every thread after counting and after every thread is done with it I want to write the output to original array.

{
  int i = get_global_id(0);
  int n = get_global_size(0);
  uint iKey = in**;
  int pos = 0;
  for (int j=0;j<n;j++)
  {
    uint jKey = in[j];
    bool smaller = (jKey < iKey) || (jKey == iKey && j < i);
    pos += (smaller)?1:0;
  }
  barrier(CLK_GLOBAL_MEM_FENCE); //suspend here until all threads gets here after counting
  //out[pos] = iKey; //normally I would write it to the output array
  in[pos] = iKey; //write to original array
}```

Is there anything I should do about this outside in queue api or sth?

EDIT:
I found that barrier works only for work group items, yet work group size is limited and probably device/architecture specific...

Right, I think there is not method for synchronizing between work-groups - so this has to be solved differently. (I can’t give any hints from the tip of my head, but could have a closer look at this if necessary)

I think i know now why isn’t it possible - Like ‘best practices guide’ from nvidia said, gpu will hold up to 768/1024 live threads so it probable defines the group size. So if group is as big so two of them won’t fit in max group size, the groups will be run sequentially one after another and there is no way to sync them.