Calling a kernel from another kernel

lodato87 · 8. Januar 2011 um 05:13

@pre-condition: sorry for my English

Hi,
I have a question: I have an application that use a OpenCL kernel. However this kernel is not optimized for GPU processing so the question is: Can I modify tha application so:

I build other 4 kernel
the actual kernel call other kernel?

thank you in advance for any help

PS: in the actual kernel the difference between i5 and GTX280 is only 10 second!!!

Marco13 · 8. Januar 2011 um 11:54

Hello,

I’m not sure if I understood you right.

You can call one kernel from another kernel. But note that the global and local work-item diemensions can not be changed with these calls. So it would probably (!) not make sense to use one kernel only for “dispatching” to another kernel. Instead, you could make the decision about which kernel you want to call on the host side, roughly like


if (deviceTypeIsCPU)
{
    clEnqueueNDRangeKernel(commandQueue, **kernelA**, 1, null, **sizeA**, null, 0, null, null);
}
else
{
    clEnqueueNDRangeKernel(commandQueue, **kernelB**, 1, null, **sizeB**, null, 0, null, null);
}

Does this answer your question?

(BTW: You could even go much further. You can create the source code for the kernel depending on the properties of the device, but there probably are not many cases where this is worth the effort.)

bye

lodato87 · 8. Januar 2011 um 18:52

Yes you had answered my question.
inquiries concerning “not make sense to use one kernel…” That’s right! if I can’t change global dimension is useless, because is like a function call.

I give up.
I want to try Encog(that use your library) because my Neural Network want about 2 year for training finish :(. Encog use openCL, but the kernel is very big (instead of a Kernel for each Neuron it has a kernel for the single step processing).
So the simple method for repair the problem is modify di Kernel but if I can’t change global dimension is useless.

than you for your help,

it’s a shame, look how many nested for:(

#define POSITIVE_ETA 1.2f
#define NEGATIVE_ETA 0.5f
#define DELTA_MIN 0.00001f
#define MAX_STEP 50.0f	

#define PARRAY_INPUT_COUNT 0
#define PARRAY_OUTPUT_COUNT 1
#define PARRAY_LAYER_COUNT 2
#define PARRAY_LEARN 3 
#define PARRAY_START 4
#define PARRAY_ITEMS_PER 5
#define ITERATIONS 6

kernel void NetworkTrain(
    global read_only int *params,
    global write_only float *errors,
    global read_only int *layerIndex,
    global read_only int *layerCounts,
    global read_only int *layerFeedCounts,
    global read_only int *weightIndex,
    global read_only float* input,
    global read_only float* ideal,
    global read_only float* weightsIn,
    global write_only float* weightsOut,
    global write_only float *gradientsOut,
    global read_only int *activationType,
    global read_only float *tempDataIn,
    global read_only float *tempDataOut,
    global read_only float *gradientsIn
    )
{
	private float layerOutput[NEURON_COUNT];
	private float layerDelta[NEURON_COUNT];
			
	int taskIndex = get_global_id(0);
	int globalSize = get_global_size(0);
	
	int inputSize = params[PARRAY_INPUT_COUNT];
    int outputSize = params[PARRAY_OUTPUT_COUNT];
    int layerCount = params[PARRAY_LAYER_COUNT];
    int trainingOffset = params[PARRAY_START];
	int itemsPer = params[PARRAY_ITEMS_PER];
	
	int gradientOffset = (taskIndex*WEIGHT_COUNT);

	int iterations = params[ITERATIONS];
	
	while( (iterations--)> 0 )
	{
		// clear out the gradients and errors
		errors[taskIndex] = 0;
		for(int i=0;i<WEIGHT_COUNT;i++)
		{
			gradientsOut[gradientOffset+i] = 0;
		}
	
		for(int trainIndex=0;trainIndex<itemsPer;trainIndex++)
		{		 
			int subtaskIndex = (taskIndex*itemsPer)+trainIndex+trainingOffset;
		
			// part 1: forward pass
			int taskInputIndex = subtaskIndex * inputSize;
			int taskIdealIndex = subtaskIndex * outputSize;
	
			int sourceIndex = NEURON_COUNT - layerCounts[layerCount-1];
		
			for(int i=0;i<NEURON_COUNT;i++)
				layerOutput** = 1;
		
			// load the input into the layer output array, this feeds the first layer.
			for(int i=0;i<inputSize;i++)
				layerOutput[sourceIndex+i] = input[taskInputIndex+i];
				
			for (int currentLayer = layerCount - 1; currentLayer > 0; currentLayer--)
			{
				int inputIndex = layerIndex[currentLayer];
				int outputIndex = layerIndex[currentLayer - 1];
				int inputSize = layerCounts[currentLayer];
				int outputSize = layerFeedCounts[currentLayer - 1];
				int index = weightIndex[currentLayer - 1];

				global float *wptr = weightsIn+index;
				for (int x = 0; x < outputSize; x++)
				{
					float sum = 0;
					float *outputPtr = layerOutput+inputIndex;
					for (int y = 0; y < inputSize; y++)
					{
						sum += *(wptr++) * layerOutput[inputIndex + y];
					}
       
					layerOutput[outputIndex + x] = ACTIVATION(sum, 1.0);
				}
			}
		
			// part 2: backward pass
			// process the output layer first
	
			float e = 0;
   
			for(int i=0;i<outputSize;i++)
			{
				float diff = ideal[taskIdealIndex+i] - layerOutput**;
				e+=diff*diff;
				layerDelta** = diff * DERIVATIVE(layerOutput**, 1.0);
			}				

			errors[taskIndex] += e;

			// process hidden layers
		
			for(int currentLevel = 0; (currentLevel<layerCount-1); currentLevel++)
			{
            	int fromLayerIndex = layerIndex[currentLevel + 1];
            	int toLayerIndex = layerIndex[currentLevel];
            	int fromLayerSize = layerCounts[currentLevel + 1];
            	int toLayerSize = layerFeedCounts[currentLevel];

				int index = weightIndex[currentLevel];
	
				// handle weights
				int yi = fromLayerIndex;
				for (int y = 0; y < fromLayerSize; y++) 
				{
					float output = layerOutput[yi];
					float sum = 0;
					int wi = index+y;
					global float * gptr = gradientsOut+wi+gradientOffset;
					global float * wptr = weightsIn + wi;
					float * dptr = layerDelta + toLayerIndex;
					for (int x = 0; x < toLayerSize; x++) 
					{
						*gptr += output * (*dptr);
						sum += (*wptr) * (*dptr);
						wi+=fromLayerSize;
						wptr+=fromLayerSize;
						gptr+=fromLayerSize;
						dptr++;
					}
			
					layerDelta[yi] = sum * DERIVATIVE(
            	      layerOutput[yi],
                	  1.0);

					yi++;
				}
			}
		}	
	
		// now that the gradients have been calculated, update the network
		barrier(CLK_GLOBAL_MEM_FENCE);
	
		if( taskIndex==0 )
		{
			// loop over all gradients and sum them into the first "global" task
			for(int i=0;i<WEIGHT_COUNT;i++) 
			{
				gradientsOut**+=gradientsIn**;
				for(int j=1;j<globalSize;j++)
				{		 
					gradientsOut** += gradientsOut[(j*WEIGHT_COUNT)+i];
				}
			}
		}

		if( taskIndex==0 && params[PARRAY_LEARN]>0 )
		{
			// teach the weights
#ifdef LEARN_RPROP
			global float *wptr = weightsIn;
			global float *gptr = gradientsOut;		
			for(int i=0;i<WEIGHT_COUNT;i++)
			{
				int change = sign((*gptr) * tempDataIn**);
				float weightChange = 0;

				// if the gradient has retained its sign, then we increase the
				// delta so that it will converge faster
				if (change > 0) 
				{
					float delta = tempDataIn[i+WEIGHT_COUNT]
						* POSITIVE_ETA;
					delta = min(delta, MAX_STEP);
					weightChange = sign(*gptr) * delta;
					tempDataIn[i+WEIGHT_COUNT] = delta;
					tempDataIn** = *gptr;
				}
				else if (change < 0) 
				{
					// if change<0, then the sign has changed, and the last
					// delta was too big
					float delta = tempDataIn[i+WEIGHT_COUNT]
						* NEGATIVE_ETA;
					delta = max(delta, DELTA_MIN);
					tempDataIn[i+WEIGHT_COUNT] = delta;
					// set the previous gradient to zero so that there will be no
					// adjustment the next iteration
					tempDataIn** = 0;
				} 
				else if (change == 0) 
				{
					// if change==0 then there is no change to the delta
					float delta = tempDataIn**;
					weightChange = sign(*gptr) * delta;
					tempDataIn** = gradientsOut**;
				}
			
				*(wptr++)+=weightChange;
				gptr++;
			}	
#endif
#ifdef LEARN_BPROP		
			for(int i=0;i<WEIGHT_COUNT;i++)
			{
				float delta = (gradientsOut***tempDataIn[0]);
				weightsIn**+=delta+(tempDataIn[i+2]*tempDataIn[1]);
				tempDataIn[i+2] = delta;
			}
#endif
#ifdef LEARN_MANHATTAN
			for(int i=0;i<WEIGHT_COUNT;i++)
			{
				int direction = sign(gradientsOut**);
				weightsIn**+=tempDataIn[0]*direction;
			}
#endif
		}
	
		barrier(CLK_GLOBAL_MEM_FENCE);
	}
	
	if( taskIndex==0 )
	{
		// finally, after all is done, return the weights to the CPU
		for(int i=0;i<WEIGHT_COUNT;i++)
		{
			weightsOut** = weightsIn**;
		}
		
		for(int i=0;i<(WEIGHT_COUNT*2);i++)
		{
			tempDataOut** = tempDataIn**;
		}
	}
	
}

Marco13 · 9. Januar 2011 um 06:49

It’s hard to give any hints there, because the kernel is very complex. Unfortunately I could not yet have a closer look at how OpenCL is used in Encog. So I’m not sure how this kernel is called, and how you could modify the kernel (or how it is called) to suit your needs.

In any case, you might want to try breaking it down into several functions, e.g. pulling out the
for (int currentLayer =…
and
for(int currentLevel =…
loops into functions which are contained in the same source code file as the kernel. If this is possible without passing the whole set of parameters to each sub-function, this could make it easier to see possible optimizations.

From a first, short (!) glance at the soruce code:

for(int trainIndex=0;trainIndex<itemsPer;trainIndex++)
{        
     int subtaskIndex = (taskIndex*itemsPer)+trainIndex+trainingOffset;
...

This pattern, for computing a local index, looks like it might be possible to exploit the structure of workgroups and local IDs there. Roughly(!) like it might be possible to adopt this so that it turns into

int subtaskIndex = get_global_id(0)+trainingOffset;
...

But I don’t know whether this is really possible, or which sort of restructuring would be necessary for that.

The ‘params’ are known when the kernel is started, and do not change. So depending on the remaining architecture, it could be possible to define the kernel source code on Java side(!) as


String code = 
    "#define inputSize "+params[PARRAY_INPUT_COUNT]+"
"+
    "#define outputSize "+params[PARRAY_INPUT_COUNT]+"
"+
    ...
    + remainingCodeString;

According to your question, you may also have considered to break this down into several independent kernels, for example, one kernel for the forward pass, one for the backward pass, one for the hidden layers and one for the network update, but this is just a guess(!) - only someone who has a deep understanding of what is happening there can judge whether this might be helpful (or whether it might be possible at all).

Apart from that, it looks like there are many global memory accesses. It might also be possible to exploit local memory there. This certainly is an advanced optimization, but keeping it in mind during other attempts of restructuring or optimization may help to avoid modifications which prevent further optimizations of this kind.

bye
Marco