Cern.colt.matrix and JOCL

system · 9. Juli 2010 um 02:01

Hi,

I want to paralyze with OPENCL / JOCL this matrix calculation :

static void cal(DoubleMatrix1D Ri_Y, DoubleMatrix2D Mi_YY ) {
for (int r = Ri_Y.size() - 1; r >= 0; r–) {
Ri_Y.setQuick(r, expE(Ri_Y.getQuick®));
if (Mi_YY != null) {
for (int c = Mi_YY.columns() - 1; c >= 0; c–) {
Mi_YY.setQuick(r, c, expE(Mi_YY.getQuick(r, c)));
}
}
}
}

The matrix are from cern.colt.matrix.

My first question is how to map this java object matrix with a openCl stucture.

regards

Marco13 · 9. Juli 2010 um 05:42

Hello

So you want to replace all elements of a Vector and a Matrix with the value expE(element) ?!

Note that, as far as I know, there currently are no implementations of OpenCL that support double precision (except, maybe, on MacOS?). So the values of the matrix will probably have to be converted to an 1D array of float values for the computation.

There may be several ways to achieve this, and it’s hard to tell which is the best one beforehand. A first approach would be to simply walk through the matrix and write the values into a float array


for (int r=0; r<rows; r++)
{
    for (int c=0; c<cols; c++)
    {
        floatArray[c+r*cols] = (float)matrix.getQuick(r,c);
    }
}

then copy this array into a cl_mem object, and pass this to the OpenCL kernel, which may be executed by “floatArray.length” threads. Afterwards, the values may be written back from the cl_mem into the array, and finally back to the Matrix using setQuick.

Since you did not use anything like getNonZeros in the exsiting code, I assume that the Matrices are Dense (or more specifically: That they are of the specific type DenseDoubleMatrix1D/2D). IF the OpenCL implementation supported double values, you could even consider to use a specific subclass of DenseDoubleMatrix, which exposes the array of values which is used internally (via a get-Method - this is possible since this array is only protected and not private). This would save the effort of the loop from above, since you could copy this array directly into a cl_mem object.

bye

system · 12. Juli 2010 um 23:29

Hi Marko,

First thanks a lot for your very detailed response.

So you want to replace all elements of a Vector and a Matrix with the value expE(element) ?!

Yes, this task is call 200 000 by hour in a artificial learning program.

Thank for all

Marco13 · 13. Juli 2010 um 02:42

Some more details might be helpful, e.g.

whether this is a sparse or a dense matrix
whether it HAS to be stored and/or computed in double precision
whether this step or addidional operations may be processed solely on the graphics card
For example, if you have a large sparse matrix which HAS to be in double precision, and the operation you described is the only one that may be performed on the GPU, the speedup might not be so great. But if you have a dense matrix with float entries, and you do NOT have to copy the data between the host and the device in each step, this could be more beneficial.

system · 15. Juli 2010 um 02:32

Hi Marco,

I use DenseMatrix.
The matrices use double, i need to estimate if can use float inside double.

I had make a very simple benchmark :

1 convert the 1D martix to a float array

2 make the exp calculation on GPU with this openCL code.

private static String programSource =
“__kernel void "
+ “sampleKernel(__global const float *a,”
+ " __global float *c)”
+ “{”
+ " int gid = get_global_id(0);"
+ " c[gid] = exp(a[gid]) ;"
+ “}”;

the first result is bad, the opencl code is 10 time slower, but my configuration is pore two pseudo GPU (ATI stream).

The code for this execution time :

// Set the arguments for the kernel
clSetKernelArg(kernel, 0,
Sizeof.cl_mem, Pointer.to(memObjects[0]));
clSetKernelArg(kernel, 1,
Sizeof.cl_mem, Pointer.to(memObjects[1]));
System.out.println(“clSetKernelArg”);

    // Set the work-item dimensions
    long global_work_size[] = new long[]{nb};
    long local_work_size[] = new long[]{1};

    // Execute the kernel
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, local_work_size, 0, null, null);
    System.out.println("Execute the kernel");
    // Read the output data
    clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
            n * Sizeof.cl_float, dst, 0, null, null);

I have a question : how can i use and reuse the same openCL program at each iteration without re initialize the onenCL context ?

I will make this benchmark with a nvidia card.

Thanks a lot

kim

Marco13 · 15. Juli 2010 um 11:00

Hello,

As I mentioned, there are several aspects that may influence the speedup that can be achieved. When you have a Java Code like

static void cal(DoubleMatrix1D Ri_Y) {
    for (int r = Ri_Y.size() - 1; r >= 0; r--) {
        Ri_Y.setQuick(r, expE(Ri_Y.getQuick(r)));
    }
}

and convert it to use OpenCL into something like

static void cal(DoubleMatrix1D Ri_Y) 
{

    // Copy matrix into array
    float array[] = new float[Ri_Y.size()]
    for (int r = Ri_Y.size() - 1; r >= 0; r--) {
        array[r] = (float)Ri_Y.getQuick(r);
    }

    // Create memory object from the array
    cl_mem mem = clCreateBuffer(... Pointer.to(array), ...);

    // Set up arguments and call the kernel
    ...
    // Copy back the result to the array
    clEnqueueReadBuffer(..., mem, Pointer.to(array)...);

    // Array into matrix
    for (int r = Ri_Y.size() - 1; r >= 0; r--) {
        Ri_Y.setQuick(array[r]);
    }
]

then it will most likely be slower than the plain Java implementation. The GPU is especially fast for computations that require lots of artihmetics (or the built-in functions, like the ones for trigonometry). In the example above, the computation is memory bound, and most time will be used for copying the memory between the host and the device. That’s why I mentioned that it would be good when…you do NOT have to copy the data between the host and the device in each step.

I have a question : how can i use and reuse the same openCL program at each iteration without re initialize the onenCL context ?

Yes, of course you can call the same program multiple times. And you should definitely do that. The initialization of a new context might be very time-consuming. The basic structure of your program could probably roughly (!) be like that:

class CLCode
{
    // Private CL specific variables
    private cl_command_queue commandQueue;
    private cl_context context;
    private cl_kernel kernel;

    // Possibly you could also declare the cl_mem object here
    cl_mem mem;

    public void initialize()
    {
        // Initialize the context, command queue and kernel here
        ...
        // If the size of the cl_mem does not change between 
        // the calls, you could also initialize the memory object here
        ...
    }    

    public void compute(float array[])
    {
        // Write the array data into the cl_mem object
        clEnqueueWriteBuffer(..., mem, Pointer.to(array)...);

        // Set up the arguments and execute the kernel
        clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
        ...        
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, local_work_size, 0, null, null);
        
        // Read the cl_mem object back into the array
        clEnqueueReadBuffer(..., mem, Pointer.to(array)...);
    }

}

So that in the actual „compute“ method, you only have to copy the data to the device, execute the kernel, and copy the data back to Java.

bye