JCuda Invert Image Functions Call multi times

haideraqeel86 · 17. November 2018 um 22:21

Hi I’m new to JCuda but good with Java and moderate level of Knowledge to cuda. I’m following this example to invert image, https://github.com/jcuda/jcuda-imagej-example. modified it little, instead of showing on screen I’m saving the inverted images on disk.

My system has 8 processors so if i write this program in Java, I will use to 8 threads to convert 8 image at a time. Please assume to total number of images is 1000.

The process which take most of the time is the pixel inverting loop. which i want to make it parallel.
e.g

for (int x = 0; x < width; x++) {
for (int y = 0; y < height; y++) {
//TODO inverting method
}
}

Now I want to understand, can i send data using 8 threads to GPU?

so instead of above code of for loop i want to execute my parallel code, which involve, loading of ptx or cu file. allocate memory, call the kernal method and get the output.

Do i have to load the ptx/cu every time?
If no so i have to allocate the memory for the pixel and call the function?

Thank you
–Aqeel Haider

Marco13 · 18. November 2018 um 11:41

Hello,

First a general remark: Depending on the overall workflow, it might be hard to achieve a good speedup there. It’s not entirely clear what the „inverting method“ actually does. If it is really just a pixel-wise inversion like the one in jcuda-imagej-example/src/main/resources/JCudaImageJExampleKernel.cu at master · jcuda/jcuda-imagej-example · GitHub , then there is not so much computation involved. Copying the memory from the host to the device might then be more expensive than the actual computation. (The sample should basically only show how to write an ImageJ plugin in general, and the Kernel was only a basic example).

But you can give it a try - particularly, if the inversion involves a more complex computation, or you later want to move other processing steps to the GPU as well.

Some details may depend on the overall architecture: Is this supposed to be an ImageJ plugin in the end, or did you just use the code as an example for „a kernel that does some image manipulation“? How large will the images be? 100x100 pixels, or 10000x10000 pixels? …

Now I want to understand, can i send data using 8 threads to GPU?

You could do that, but should carefully think about whether this really makes sense. The GPU will, at this point, be like a „queue“. It can only process one image at a time.

In principle, you could set up a really sophisticated infrastructure there. You could add support for multiple GPUs, and use multiple CUDA contexts, and define own streams and stream callbacks to notify the Java threads when a computation is finished. But this is not entirely trivial. When you want to access the GPU with multiple Java threads, you always have to make sure that the Java thread is the „current“ thread for the respective context.

To put it that way: You should only invest the time here when you are reasonably sure that it will be worth the effort.

In any case, the rules of thumb that apply for plain Java programming also apply for CUDA/JCuda: You should not do things twice when it is sufficient to do them once.

This particularly refers to loading the setup and initialization: You should load the .PTX file only once. This initialization is really expensive (compared to everything else). You can basically load it wherever you initialize CUDA: Where you create the context, you can load the CUmodule and obtain the CUfunction, and then use this function, again and again, until you’re done with the computation.

(I’m not deeply familiar with the „lifecycle“ of ImageJ plugins. I’d also have to look up and experiment how this could best be done for an ImageJ plugin).

Whether or not you have to re-allocate memory depends on whether the images all have the same size, or whether you know the maximum size. Memory allocation can also be expensive. If you really have some sort of „batch processing“ for images, you could have a pattern like this (pseudocode) :

for (Image image : images) {
    Pointer data = allocateMemoryFor(image.size());
    copyToGpu(image.pixels, data);
    executeKernel(data, image.size());
    copyToHost(data, image.pixels);
    free(data);
}

If you already know that the images will all have the same size, or you already know the „maximum“ image size, then you could pull the allocation out of the loop:

Size maximumImageSize = maxSizeOf(images);
Pointer data = allocateMemoryFor(maximumImageSize);
for (Image image : images) {
    copyToGpu(image.pixels, data);
    executeKernel(data, image.size());
    copyToHost(data, image.pixels);
}
free(data);

bye
Marco

haideraqeel86 · 20. November 2018 um 14:39

Thank you Marco for replaying. The main goal is not to create an ImageJ Plugins, It was reference because my program was exactly same. just invert function was different.

You were right CPU code was fast because of 8 CPU, we were loading the image from disk and get its pixels this steps was also taking time, the original computation was also taking time, we want to port the computational code to GPU code, but we don’t want to lose the current speed, so goal was App should be multi thread for loading the image data and the pixel Algo should be GPU.

We solve this problem by BlockingQueue in Java and producer Consumer concept, So 8 threads were responsible to load image data and put them in queue and GPU code get the image data from queue and perform action it, which is super fast. before it was taking around 100 secs for 1000 images now 13 secs for 1000 images.

One thing is confusing though in Cuda, In threads I was getting context error, So i have to initialize the GPU code in the run method, I didn’t find any example or sample code of JCuda which were running in Threads. Kindly point me out on some if you know some.

Thank you
–Aqeel Haider

Marco13 · 20. November 2018 um 16:37

So there is one thread responsible for taking the tasks out of the queue and forwarding them to the GPU? Then it should be sufficient to make the respective CUDA context current for this thread. In the best case, this boils down to calling cuCtxSetCurrent(context) at the right place. You also have to make sure that all resources are associated with this context, but if you only have one context, then this should not be too difficult either.

An example that could be rather similar to your task is this one: jcuda-samples/JCudaSamples/src/main/java/jcuda/driver/samples/JCudaDriverStreamCallbacks.java at master · jcuda/jcuda-samples · GitHub As the name suggests, it is about stream callbacks, which you may actually not need here, but it should contain some of the building blocks that are similar to your setup (i.e. it also creates an executor service and multiple threads to feed the GPU with a workload (an „artificial“ one in the example, though).

haideraqeel86 · 21. November 2018 um 00:42

Thank you I get it, So I need to call the cuCtxSetCurrent(context) in the thread run method which is currently performing action.

So if i have two GPU devices in my system, then i have to create two context for each GPU?

Thank you for your all the help and pointing me to the right direction. I’m attaching the src file kindly have a look and suggest me if you think i need to modify/correct something

Thank you
Aqeel Haider

ColorPercentageWithWorker.txt (12.9 KB)

Marco13 · 21. November 2018 um 20:10

The context handling in CUDA can be a bit difficult. I have read a lot about it, and am still looking for an answer in api design - How to implement handles for a CUDA driver API library? - Stack Overflow …

However, in CUDA,Context and Threading - CUDA Programming and Performance - NVIDIA Developer Forums , one of the designers of all this said

The programming model that I generally recommend is one context per device per process. In 4.0, it’s really trivial to share these; just create them (either with driver or runtime API, doesn’t matter) and use them from whichever thread you want.

So basically and to my understanding: When you have multiple GPUs, then you would create one context per GPU. Things may become a bit trickier then. As mentioned above, you always have to make sure that all operations remain in one context. So you cannot allocate memory in context A and use it in context B.

A detailed review of the code that you attached might take some time, and I’m not sure when I can allocate the time for that, but will try to at least have a short look ASAP to see whether I notice something that can „obviously“ be improved.

Marco13 · 26. November 2018 um 21:38

I had a short look at the code. I could go into details (unused methods, missing ImageData class etc), but can mainly give general hints right now.

I noticed that a lot of the code was built around the own “Task” implementations with quite some locks and atomics. Some of this could probably be solved a bit simpler. The “rule of thumb” that one should usually implement Runnable instead of extending Thread might seem like a detail here, but I think that much of the task queue infrastructure could be handled by with a standard ExecutorService : It already maintains all the blocking queue stuff, and offers some convenience functionalities.

However, the PercentageTask is the main class that is relevant for JCuda itself. And I did not spot any “obvious” error there.

At one point, you wrote

cuInit(deviceNo);

but this should be cuInit(0) (the parameter is only a flag, and has to be 0).

The number of devices can be determined with this snippet:

	cuInit(0);
	int deviceCountArray[] = { 0 };
	cuDeviceGetCount(deviceCountArray);
	int deviceCount = deviceCountArray[0];
	System.out.println("Found " + deviceCount + " devices");

Beyond that, of course, some of the (indeed, rather verbose and somewhat cumbersome) CUDA-related code is hidden in the CudaUtils class. I can imagine that many people create such a set of utility functions, and in many cases, they are probably assembled from the samples. I really should update https://github.com/jcuda/jcuda-utils : This could be a nice place to collect more of the frequently used functions. Particularly, the KernelLauncher class could be handy, once it is extended to work with the NVRTC …

After reading the code, I already started playing around, about to update these utilities and create some new ones, including a wrapper around an ExecutorService that maintains the CUcontext instances for its threads as a ThreadLocal, but I definitely have to allocate more time for that (much more than I have right now).