atomicCAS not working

Hi guys,

I want to increment a variable “place” atomicaly, so i did this in my kernel file

__device__ int place;
__device__ int* mutexPlace;

extern "C" {
__global__ void sampleAndSave(double **fullTrajs, double **newTrajs,int nbNewTrajs, double *params,int* newTrajsStorIdxs){
    int tid = threadIdx.x;    
    
    printf("
%d : I started", tid);
    if(tid==0){
       mutexPlace = (int*)malloc(sizeof(int));
       mutexNbTrajs = (int*)malloc(sizeof(int));
       *mutexPlace = 0;
       *mutexNbTrajs = 0;
    }
    place = params[1];  /* params[1] == 0 */
    ...
    ...
    ...
    printf(" %d : i'm before atomicCAS
");
    while(atomicCAS(mutexPlace, 0, 1) != 0);
    printf(" %d : i'm after atomicCAS
");
    storageIdx = place;
    place++;
    printf("
 %d ->%d",tid, place);
    atomicExch(mutexPlace, 0);
    ....
    __syncthreads();

But when executing, the system takes a very long time (~15seconds) and then prints that all threads are before atomicCAS but none of them passed it (may be the first thread did pass it because after some printfs i saw that it changes the mutex from 0 to 1 but in all cases i never got “i’m after atomicCAS” printed). after that it launchs a CUDA_ERROR_LAUNCH_TIMEOUT

any help will be appreciated

Hello,

Admittedly, since I only recently updated from my old GeForce 8800 to a newer card, I have not yet really used atomics, and can’t tell what’s wrong there from just looking over the code. Websearches bring some results that may be related ( c++ - CUDA, mutex and atomicCAS() - Stack Overflow , multithreading - How to implement a critical section in CUDA? - Stack Overflow ), suggesting that the fact that all threads in one warp are entering the critical section may be a reason for the “deadlock” that you are observing. But I’ll have to run my own tests and do some more experiments in order to understand more thoroughly what’s going on there.

(sorry, not a real help for now)
Marco

Thanks Marco, that was very helpful.
The comment in the first link explained it very well:


Avoid different execution paths within the same warp.

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

I fixed it with atomicAdd

Good to hear that you managed to resolve this. (If you have a nice, short example of using the atomic operations that could serve as a sample, then I’d be happy to add it)