Best practice to use a CUDA class?

seinecle · 6. März 2012 um 15:25

Hi,

I run a regular Java code and uses jCUDA for the part of the calculus that computes a dot Product on 2 vectors of size 10,000. Precisely:

looping through a matrix of size (10000,10000):

for (int i = 0;i<10000,i++){
for (int j = 0;j<10000,j++){
[INDENT]
 myjCUDAdotProductObject DP = new myjCUDAdotProductObject(vector col(i),vector col(j))[/INDENT]

}}

=> for the moment, the java app crashes after 1% of the loops completed (see previous post on this forum)
=> but here my question is: is it the best way to send data to the GPU? I suspect it would be better to send it in batch, do calculations, and send it back to the main program? The way I do it now might even be a reason for the crash?

Any help or advice would be appreciated!

Thanks,

Clement

PS: the class myjCUDAdotProductObject if that can make things more clear:

import cern.colt.matrix.DoubleMatrix1D;
import cern.colt.matrix.impl.SparseDoubleMatrix1D;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas;


public class JCublaDotProduct

{

static double dotProductCUDA;
static double dotProductJava;
static int n;         


    public static double dotProduct(double A[],double B[])
    {

        double h_A[] = A;
        double h_B[] = B;
        n = A.length;
       
        
//        Clock javaClock = new Clock("Performing SDot with Java");
        SDotJava(h_A, h_B);
//        javaClock.closeAndPrintClock();
//        Clock JCublasClock = new Clock("Performing SDot with JCublas");
        SdotCuda(h_A, h_B);
//        JCublasClock.closeAndPrintClock();
//        System.out.println("CUDA: "+dotProductCUDA);
//        System.out.println("JAVA: "+dotProductJava);
        return dotProductCUDA;
//        System.out.println("dot Product with Java is "+ dotProductJava);
//        System.out.println("dot Product with CUDA is "+ dotProductCUDA);
    }



     private static void SdotCuda(double A[], double B[])

             
    {


        // Initialize JCublas
        JCublas.cublasInit();
        JCublas.setExceptionsEnabled(true);
        

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        JCublas.cublasAlloc(n, Sizeof.DOUBLE, d_A);
        JCublas.cublasAlloc(n, Sizeof.DOUBLE, d_B);

        // Copy the memory from the host to the device
        JCublas.cublasSetVector(n, Sizeof.DOUBLE, Pointer.to(A), 1, d_A, 1);
        JCublas.cublasSetVector(n, Sizeof.DOUBLE, Pointer.to(B), 1, d_B, 1);


        // Execute SDot
        dotProductCUDA = JCublas.cublasDdot(
            n, d_A,1, d_B,1);

        

        // Clean up
        JCublas.cublasFree(d_A);
        JCublas.cublasFree(d_B);

        JCublas.cublasShutdown();
    }




     // this function is useful when the vectors have a length of 1000 or less
     private static void SDotJava(double A[], double B[]){
         

//    double[] A2 = new double[A.length];
//    double[] B2 = new double[B.length];
//    for (int i = 0; i < A.length; i++)
//    {
//        A2** = A**;
//        B2** = B**;
//    }

         DoubleMatrix1D sourceDoc = new SparseDoubleMatrix1D(A);
         //sourceDoc.assign(A2);
         DoubleMatrix1D targetDoc = new SparseDoubleMatrix1D(B);
         //targetDoc.assign(B2);
         
         dotProductJava = sourceDoc.zDotProduct(targetDoc);
         
         
     }

}

Marco13 · 7. März 2012 um 01:42

Hello

Concerning the crash, I wrote a few words at the other thread ( http://forum.byte-welt.de/showthread.php?p=17215#post17215 )

Concerning the best practices: Indeed, this question was asked several times recently, and I wrote a little bit in the respective threads (e.g. http://forum.byte-welt.de/showthread.php?t=3618 or http://forum.byte-welt.de/showthread.php?t=3800 ) but the information there is rather cluttered and specific for the problem, so won’t be so helpful for you right now.
I intended to write some small ““tutorial”” for general approaches, but at the moment I’m far away from having enough time for that -_-

However, concerning the specific task you are performing: Of course, you should avoid repeated initializations and shutdowns. On the one hand, this has the slightly undesirable effect that a CUDA method can not so easily be used as a “drop-in” replacement of a Java method, but one has to assume that things like the initialization of CUBLAS and the memory allocations eat up a lot of performance.

So instead of using a pattern like

for (int i=0; i<n; i++)
{
    DotProductComputer.computeDotProduct(A**, B**); // Contains cublasInit, cudaMalloc etc...
}

you should consider using a pattern like

DotProductComputer.initialize(); // Contains cublasInit
// If possible
DotProductComputer.prepare(vectorSize); // Contains cudaMalloc for the given vector size
for (int i=0; i<n; i++)
{
    // This call ONLY does the minimal work: Copying
    // the data to the GPU and calling cublasSdot
    DotProductComputer.computeDotProduct(A**, B**); 
}
DotProductComputer.shutdown(); // Contains cublasShutdown

Alternatively, you may try to encapsulate your task into a method on a higher level, roughly speaking: Not creating a “DotProductComputer” class, but instead, a “DotProductsOfMatrixComputer” that only receives the matrices, and performs any necessary initializations and allocations internally, and as few of them as possible…

bye
Marco

seinecle · 9. März 2012 um 01:43

Hi Marco,

Thank you so much for your help on the two threads. I moved the .init(), .shutdown(), alloc and free outside the loops and this fixes the crashes (and speeds up the thing a bit too, but not too much).

Now, I’d like to see how I could send “batches” of vectors to the GPU, instead of one by one for the moment. My code is not parallel at all! I know you have been very helpful already, but do you have any advice?

[to recall, I am looping through all elements of a matrix, and computing a cosine distance for each element]

Best,

Clement

Marco13 · 9. März 2012 um 10:04

Hello

I’m not sure what you mean by “batches”. JCublas2 contains some “batched” methods, but not Ddot.

BTW: You should consider switching to JCublas2 (which corresponds to the native CUDA “cublas_v2.h” header). The differences are not so bit until now, but JCublas2 contains some methods that may be relevant when you intend to run tasks on multiple GPUs. (May this be what you meant by “batching”?

bye