It’s hard to give any hints there, because the kernel is very complex. Unfortunately I could not yet have a closer look at how OpenCL is used in Encog. So I’m not sure how this kernel is called, and how you could modify the kernel (or how it is called) to suit your needs.
In any case, you might want to try breaking it down into several functions, e.g. pulling out the
for (int currentLayer =…
and
for(int currentLevel =…
loops into functions which are contained in the same source code file as the kernel. If this is possible without passing the whole set of parameters to each sub-function, this could make it easier to see possible optimizations.
From a first, short (!) glance at the soruce code:
for(int trainIndex=0;trainIndex<itemsPer;trainIndex++)
{
int subtaskIndex = (taskIndex*itemsPer)+trainIndex+trainingOffset;
...
This pattern, for computing a local index, looks like it might be possible to exploit the structure of workgroups and local IDs there. Roughly(!) like it might be possible to adopt this so that it turns into
int subtaskIndex = get_global_id(0)+trainingOffset;
...
But I don’t know whether this is really possible, or which sort of restructuring would be necessary for that.
The ‘params’ are known when the kernel is started, and do not change. So depending on the remaining architecture, it could be possible to define the kernel source code on Java side(!) as
String code =
"#define inputSize "+params[PARRAY_INPUT_COUNT]+"
"+
"#define outputSize "+params[PARRAY_INPUT_COUNT]+"
"+
...
+ remainingCodeString;
According to your question, you may also have considered to break this down into several independent kernels, for example, one kernel for the forward pass, one for the backward pass, one for the hidden layers and one for the network update, but this is just a guess(!) - only someone who has a deep understanding of what is happening there can judge whether this might be helpful (or whether it might be possible at all).
Apart from that, it looks like there are many global memory accesses. It might also be possible to exploit local memory there. This certainly is an advanced optimization, but keeping it in mind during other attempts of restructuring or optimization may help to avoid modifications which prevent further optimizations of this kind.
bye
Marco