JCuda with existing java program and arraylists

Hello,

I am a beginner in using CUDA and therefore JCuda and I am trying to accelerate a program that wasn’t written by myself using JCuda. I have read through the tutorial and other postings in the forum at examples and explanations and I am now trying to attempt to tackle this task. Problem is the function that would like to parallelize and essentially become the kernel, from my current understandings, has ArrayLists as parameters and their types are objects. Is this posible to do still?

Thanks!

Hello

Indeed, the tutorial only covers the basic setup, and not (yet) any “best practices”. Although most of the general strategies are equal for CUDA and JCuda, and thus, the official CUDA Programming Guide / Best Practices Guide are the most reliable source of information, I occasionally wrote a few words in some forum threads that could be assembled into a quick introduction.

In general, it may be quite challenging to map an arbitrary program to CUDA. Especially when the program originally was written in Java, because the high-level, object-oriented style of Java may differ significantly from the style implied by the low-level nature of data-parallel CUDA programs. Until now, I can only say that when you have an ArrayList of objects, you have to convert the objects into a representation that is adequate for CUDA. In most cases, this will basically be a set of arrays that contain the relevant information, and that may be sent to CUDA in form of pointers to device memory. Of course, care has to be taken that the conversion and memory copies to not eat up all potential performance gains.

The information provided so far is not really sufficient for a more detailled, focussed advice. So the first question is, whether the problem can really be transformed into a data-parallel problem…?

bye
Marco

Hello and thank you for the response!!

I am not sure how familiar you are with biometrics and biometric systems but the project was to implement JCuda within the LBP program in order to optimize the computational bottleneck of the LBP. So here is the function that I wanted to use with Cuda:

     private static int processProbesToGallery(ArrayList<FaceSubject> fp, ArrayList<FaceSubject> fg, InputObject input){
        int probeSize = fp.size();
		  int gallerySize = fg.size();
		  int numErrors = 0;

        String bestSubject = "unknown";
        double bestNormalDistance = 2000000.0;
		  double normalDistance = 0.0;

		  /*
           //Calculate worst distances
		  for(int i = 0; i < probeSize; i++){
		      for(int j = 0; j < gallerySize; j++){
				    //Calculate face distance
				    for(int k = 0; k < input.faceSize; k++){
					 	  faceDistance += Math.abs(fp.get(i).getHisto(k) - fg.get(j).getHisto(k)); //City Block (Manhattan) Distance
                    }
					 if(faceDistance > worstFaceDistance){
					     worstFaceDistance = faceDistance;
					 }
					 faceDistance = 0;
			  }
		  }
		  */
          for(int i = 0; i < probeSize; i++){
		      for(int j = 0; j < gallerySize; j++){
				    //Calculate face distance
				    for(int k = 0; k < input.faceSize; k++){
					 	  normalDistance += Math.abs(fp.get(i).getHisto(k) - fg.get(j).getHisto(k)); //City Block (Manhattan) Distance
                    }
                    if(normalDistance < bestNormalDistance){
					     bestNormalDistance = normalDistance;
						  bestSubject = fg.get(j).getSubject();
					 }

					 //Reset values...
					normalDistance = 0;
              }
              if(fp.get(i).getSubject().compareTo(bestSubject)!= 0){
					 numErrors++;
					 System.out.println("Error    "+numErrors+": Current Subject: "+fp.get(i).getSubject()+"  Best Subject: "+bestSubject);
              }else{
				    System.out.println("Non-Error - Current Subject: "+fp.get(i).getSubject()+"  Best Subject: "+bestSubject);
				}
				bestNormalDistance =  2000000.0;
				bestSubject = "unknown";
          }
         return numErrors;
     }

I created a function that converts the arraylists to arrays that will be copied to the device. How is it that I will write the kernel to accept these as parameters being that the kernel is to be written in C++ ( I am using VS 2010). Like how will it recognize the pointers to these objects that are written and native to the Java program?

Thank you!

Hello

It will probably be hard to get a good speedup for this task. Unfortunately, you can’t just take an aribtrary problem/program, port it to CUDA and expect it to run significantly faster…

However, I’ve tried to compile the code locally, and by the way modified it slightly…

public class BioTest {

    private static int processProbesToGallery(
            List<FaceSubject> probe, 
            List<FaceSubject> gallery, 
            InputObject input) {

        int numErrors = 0;
        for (int i = 0; i < probe.size(); i++) {

            String bestSubject = "unknown";
            double bestNormalDistance = Double.MAX_VALUE;

            FaceSubject probeSubject = probe.get(i);
            
            for (int j = 0; j < gallery.size(); j++) {
                FaceSubject gallerySubject = gallery.get(j);

                // Calculate face distance
                double normalDistance = 0.0;
                for (int k = 0; k < input.faceSize; k++) {
                    float hp = probeSubject.getHisto(k);
                    float hg = gallerySubject.getHisto(k);
                    normalDistance += Math.abs(hp - hg);
                    
                    // TODO: Test whether this helps...
                    if (normalDistance >= bestNormalDistance)
                    {
                        break;
                    }
                    
                }
                if (normalDistance < bestNormalDistance) {
                    bestNormalDistance = normalDistance;
                    bestSubject = gallerySubject.getSubject();
                }
            }
            if (probeSubject.getSubject().compareTo(bestSubject) != 0) {
                numErrors++;
                System.out.println("Error    " + numErrors
                        + ": Current Subject: " + probeSubject.getSubject()
                        + "  Best Subject: " + bestSubject);
            } else {
                System.out.println("Non-Error - Current Subject: "
                        + probeSubject.getSubject() + "  Best Subject: "
                        + bestSubject);
            }
        }
        return numErrors;
    }
}

// Dummy classes
class FaceSubject {
    public float getHisto(int k) {
        return 0;
    }
    public String getSubject() {
        return null;
    }
}

class InputObject {
    public int faceSize;
}

(BTW: You could consider an ‘early stop’, marked with ‘TODO’, and see whether this alread helps to speed up things a little).

The following is only valid if I understood this correctly:

You basically have two 2D arrays, namely the values that are returned by
probe.get(x).getHisto(y)
and
gallery.get(x).getHisto(y)
and are computing the manhattan distance of each pair of columns of these 2D arrays? This is purely memory-bound. It does not involve any complex computation (only few arithmetics, no trigonometry, …) and will therefore hardly become faster with CUDA. Additionally, you are only interested in the maximum distance, and computing the maximum of an array with CUDA is difficult, since it is inherently a non-parallel task. (It can only be done in parallel when an efficient reduction is performed with the ‘max’-operator…)

But if you want to give it a try, you could consider building these 2D arrays (basically implemented as 1D arrays), copy them to the device, use an own kernel to compute the manhattan distances and store them in an output array, and maybe use cublasIsamax to find the index of the column with the maximum distance.

But please do not expect too much from this. It’s not unlikely that copying the input data to the device will eat up most of the time that can be saved by the parallel computation of the manhattan distances.

bye
Marco