Execution speed of For loop on GPU

system · 25. November 2010 um 06:47

Hello,

When I execute on GPU in .cu file:
int a = threadIdx.x+blockDim.x*blockIdx.x; output[a]=cosf(a/N);
I have 3 time acceleration compared with CPU.

But if I use For loop in my .cu file:
int a = threadIdx.x+blockDim.x*blockIdx.x;
for(a=0;a<N;a++){output[a]=cosf(a/N); __syncthreads();}
I have 10 times slower on GPU compared with CPU.

This is example. My real task is more complicated, I can not do It without the For loop.
How I can quickly execute For loop on GPU?

Thank you!

Marco13 · 25. November 2010 um 09:56

Hello

To my understanding, the idea of “Data Parallel Processing” could (in a slightly oversimplified way) be summarized as “Avoiding for-loops”. The most simple approach is to replace a loop like
for (int i=0; i<N; i++) doSomething(i);
with
doSomething(threadIndex);

Maybe a little bit more context information for the example might be helpful. It does not become clear what the roles of ‘a’ and ‘N’ are (and, e.g. how large ‘N’ is). I also don’t see a reason to do the syncthreads() there. From the first glance, it looks like the roles of ‘a’ and ‘N’ are just swapped - this might indicate that the problem would have to be reformulated in order to gain most from the Data-Parallel approach…

bye
Marco

system · 25. November 2010 um 10:29

More difficult example:


extern "C"
__global__ void add(int N, float two_pi, float delta, float *output){
	float sum=0;
	int eb = threadIdx.x+blockDim.x*blockIdx.x;
	for(eb=0;eb<N;eb++){
		sum=0;
		for(int ea=eb;ea<N;ea++)
			sum+=cosf(two_pi*(eb-ea));
		output[eb] = ((-2*sum*delta)/N);
		__syncthreads();
	}
}

Thank you!

Marco13 · 25. November 2010 um 13:25

Hm … I think I don’t get it: You are computing the thread index ‚eb‘, and ignoring it, but using ‚eb‘ as the variable in the for-loop instead - there’s no parallelism at at - except that all threads are executing this for-loop, which IS slow, of course

Can you explain what you want to compute? (Because I think what you are computing does not make sense in this form - but maybe I’m wrong…)

system · 25. November 2010 um 15:35

it looks like I solved my problem, I write function named LOOP that runs in parallel from function ADD, and in the function body I write what I wanted - the more difficult code:

extern “C” device float loop(int b,int N, float two_pi, float delta){
float sum;
for(int a=b;a<N;a++)sum+=cosf(two_pi*(b-a));
return -2sumdelta/N;
}

extern “C” global void add(int N,float two_pi,float delta,float output){
int eb = threadIdx.x + blockDim.xblockIdx.x;
output[eb]=loop(eb,N,two_pi,delta);
}

I received good results of the execution speed.

Thank you!

Marco13 · 26. November 2010 um 04:48

I think the execution speed should not differ so much, if both code samples computed the same thing in the same way - but at least in the first version, the parallelism is destroyed by the wrong usage of the thread index and the outer for loop…
However, when it’s now doing what it should do, and it’s doing it fast, everything is fine