I’m trying to use CUDA’s shared variables to share information across threads.
In the code below, each element of the debug is 1 when the application finishes, regardless of how many threads are deployed.
Is this behaviour due to the threads running in parallel, or should the sharedInt variable be incremented by each thread?
extern "C"
__global__ void quicksort(int *array, int numElements, int *stackUnused, int* debug)
{
__shared__ int sharedInt;
__syncthreads();
sharedInt++;
__syncthreads();
debug[threadIdx.x] = sharedInt;
__syncthreads();
}```
From a short glance, it’s most likely because the threads are all incrementing the variable in parallel. (Also note that the variable is only shared among the threads of one block). Such a “counter” could probably be implemented using atomic variables, but these may destroy parallelism. Implementing such a recursive algorithm on the GPU may be difficult, I would also have to do some websearch about possible approaches for that.
extern "C"
__global__ void quicksort(int *array, int numElements, int *stackUnused, int* debug)
{
__shared__ int sharedInt;
__syncthreads(); // unnecessary sync
sharedInt++; //race condition, and sharedInt was never initalized
__syncthreads();
debug[threadIdx.x] = sharedInt;
__syncthreads();
}```[/QUOTE]
Hint: uninitialized shared variables will lead to calculation errors!
**Naive** approach with atomics could be, but atomics in their nature are executed sequentiell.
Be aware that on CUDA < 2.0 you only have 512 maximum Threads!
(1 Element : 1 AtomicAdd)
```extern "C"
{
//GridDim.x = 1
//BlockDim.x = number of input elements;
__global__ void simpleCountNonZeroEl(unsigned int* input, unsigned int* count)
{
__shared__ unsigned int c;
//initialize shared memory shared memory has no default values
if(threadIdx.x == 0)
c = 0;
__syncthreads();
if(input[threadIdx.x] != 0)
atomicAdd(c, 1);
__syncthreads();
if(threadIdx.x == 0)
count[0] = c;
}
}```
**Less naive**
Every thread loads a TILE of the input array into its own register array,
counts the number of non zero elements,
and performs one atomicAdd on the singles shared counter variable.
Attention: reset TILE helper arrays before you use them, if your input size is no multiple of TILE_SIZE
(TILE_SIZE : 1 AtomicAdd)
This less naive algorithm is now bound to the number of registers 8192 on cuda < 2.0 in one Streaming multiprocessor.
Starting with such small examples you need to work yourself up until you maximize your kernel functions.
A few days ago I started at the same point and now I created a Matrix Column Vector Normalizing Kernel allowing up to a dimension of 65535^3 * 12.582.912 :)
Thanks for these hints.
For me, the original code looked like an attempt to count how many threads have reached a certain point, but I’m not sure - maybe it was also intended to count non-zero elements, or I don’t see the similarities between both tasks.
Really using shared memory for some form of inter-thread communication may still be tricky, however…
There was no similarity between the two codes, it was just a algorithm to perform a task like counting or add1 or all such things using a helper array.
I could not get the algorithm goal of his post, so I just counted values, just to do something on the shared variable.
My code is only suitable for Cuda 1.2 and higher. For Cuda 1.1 helper Arrays are needed to count the values to simulate atomic Operations on shared memory.