Hello,
As I mentioned, there are several aspects that may influence the speedup that can be achieved. When you have a Java Code like
static void cal(DoubleMatrix1D Ri_Y) {
for (int r = Ri_Y.size() - 1; r >= 0; r--) {
Ri_Y.setQuick(r, expE(Ri_Y.getQuick(r)));
}
}
and convert it to use OpenCL into something like
static void cal(DoubleMatrix1D Ri_Y)
{
// Copy matrix into array
float array[] = new float[Ri_Y.size()]
for (int r = Ri_Y.size() - 1; r >= 0; r--) {
array[r] = (float)Ri_Y.getQuick(r);
}
// Create memory object from the array
cl_mem mem = clCreateBuffer(... Pointer.to(array), ...);
// Set up arguments and call the kernel
...
// Copy back the result to the array
clEnqueueReadBuffer(..., mem, Pointer.to(array)...);
// Array into matrix
for (int r = Ri_Y.size() - 1; r >= 0; r--) {
Ri_Y.setQuick(array[r]);
}
]
then it will most likely be slower than the plain Java implementation. The GPU is especially fast for computations that require lots of artihmetics (or the built-in functions, like the ones for trigonometry). In the example above, the computation is memory bound, and most time will be used for copying the memory between the host and the device. That’s why I mentioned that it would be good when…you do NOT have to copy the data between the host and the device in each step.
I have a question : how can i use and reuse the same openCL program at each iteration without re initialize the onenCL context ?
Yes, of course you can call the same program multiple times. And you should definitely do that. The initialization of a new context might be very time-consuming. The basic structure of your program could probably roughly (!) be like that:
class CLCode
{
// Private CL specific variables
private cl_command_queue commandQueue;
private cl_context context;
private cl_kernel kernel;
// Possibly you could also declare the cl_mem object here
cl_mem mem;
public void initialize()
{
// Initialize the context, command queue and kernel here
...
// If the size of the cl_mem does not change between
// the calls, you could also initialize the memory object here
...
}
public void compute(float array[])
{
// Write the array data into the cl_mem object
clEnqueueWriteBuffer(..., mem, Pointer.to(array)...);
// Set up the arguments and execute the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
...
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
global_work_size, local_work_size, 0, null, null);
// Read the cl_mem object back into the array
clEnqueueReadBuffer(..., mem, Pointer.to(array)...);
}
}
So that in the actual „compute“ method, you only have to copy the data to the device, execute the kernel, and copy the data back to Java.
bye