OK… This is quite complex…
There certainly are limitations of the kernel size, but I think they are not really specified, and at least, implementation dependent. You might consider to split the kernel and pull out some parts into own __device functions, maybe this helps…
The kernel code itself is far from optimal. I don’t know about the role of the ‚Fw‘, ‚Ro‘ etc. parameters which are inlined in the source code as constants. The code itself might become clearer when these are passed in as parameters to the kernel, or as "#define"d constants (#defines can be given as parameters when building the program). It would at least allow you to write the code into a single file, and you would not have to define it as a String. But maybe this is just a subjective issue. And by the way: You should be careful with floating point constants. Some compilers will complain about constants like „0.25“ when they should be treated as ‚float‘ constants. The string „0.25“ represents a ‚double‘ value. For a float value, you should use ‚0.25f‘.
Concerning possible optimizations of the kernel:
Some possible improvements are quite obvious, and could also be applied to the CPU code. A clever compiler might recognize some of these and optimize them on the fly. But you could at least try to help him, and see if you can achieve a speedup. If not, the code may at least become simpler 
So these optimizations are only guesses - whether or not they bring an improvement has to be tested!
Something like
y = pow(x, 2)
will most likely be less efficient than a simple
y = xx
Additionally, when you already have computed y=xx, you could replace each
z = pow(x, 3)
with
z = y*x;
which might also be faster. This could be stated more general: There are lots of (really lots of) common sub-terms. Just one small example:
float M1=(0.25*Ro*H1[gid]*PI*(pow(D1, 2)-pow((D1-G1[gid]), 2)));
float M2=(0.25*Ro*H2[gid]*PI*(pow(D2, 2)-pow((D2-G2[gid]), 2)));
float M3=(0.25*Ro*H3[gid]*PI*(pow(D3, 2)-pow((D3-G3[gid]), 2)));
could be written as something like
float constantA = 0.25*Ro*PI;
float M1=(constantA*H1[gid]*(pow(D1, 2)-pow((D1-G1[gid]), 2)));
float M2=(constantA*H2[gid]*(pow(D2, 2)-pow((D2-G2[gid]), 2)));
float M3=(constantA*H3[gid]*(pow(D3, 2)-pow((D3-G3[gid]), 2)));
and possibly replacing the 'pow’s yielding
float constantA = 0.25*Ro*PI;
float dmg1 = D1-G1[gid];
float dmg2 = D2-G2[gid];
float dmg3 = D3-G3[gid];
float M1=constantA*H1[gid]*(D1*D1-dmg1*dmg1);
float M2=constantA*H2[gid]*(D2*D2-dmg2*dmg2);
float M3=constantA*H3[gid]*(D3*D3-dmg3*dmg3);
Apart from that, there are some optimizations that are more specific for OpenCL (or GPUs in general) :
You should reduce the number of global memory accesses! Each access to global memory has an overhead (latency) of about 600 clock cycles (!). The memory locations H1[gid], H2[gid] … G2[gid] are literally accessed hundreds of times. You should at least pull out these accesses, like
float h1 = H1[gid];
float h2 = H2[gid];
…
float g2 = G2[gid];
And use these values later on, replacing something like
while(h<(H1[gid]+H2[gid]))
with
while(h<(h1+h2))
Again, a compiler might detect and internally optimize this, but you should not place your bet on that…
This can even be further optimized by using local memory, but this is advanced, and I’m not so much involved in the existing code (and OpenCL) to give more specific hints than: Look this up in the OpenCL programming guide 
If possible you should reduce the number of branches (loops). But this might be hard, depending on the exact semantics of the computation, so should probably be done after all other optimization attempts.
One important point concerning the correctness of the kernel: Note that all the kernel is computed by several hundred threads in parallel. So the update in the last few lines…
if(Nr>Output[0])
{
Output[0] = Nr;
Output[1] = H1[gid];
...
}
will definitely mess up the results, and yield a lot of garbage: One thread might update Output[0], and another thread might update it simultaneously. There is no guarantee that Output[0] will contain a valid number afterwards - not to mention that Output[0] might contain results from one kernel, and Output[1] might contain results from another…
This could somehow be solved using „atomic functions“ (see Section „6.11.11 Atomic Functions“ of the OpenCL spec), but this might cause the threads to process this block sequentially, destroying lots of the parallelism. One option could be
- allocate an output area of globalSize*13 values
- let all threads write their results blindly into this memory (regardless of the quality of the solution)
- afterwards search these ‚globalSize‘ results to find the best one
Again: It’s an art (and I’m not an artist in this sense, so consider all this just as hints and not as advices…)
bye
Marco