Hello,
First a general remark: Depending on the overall workflow, it might be hard to achieve a good speedup there. It’s not entirely clear what the „inverting method“ actually does. If it is really just a pixel-wise inversion like the one in jcuda-imagej-example/src/main/resources/JCudaImageJExampleKernel.cu at master · jcuda/jcuda-imagej-example · GitHub , then there is not so much computation involved. Copying the memory from the host to the device might then be more expensive than the actual computation. (The sample should basically only show how to write an ImageJ plugin in general, and the Kernel was only a basic example).
But you can give it a try - particularly, if the inversion involves a more complex computation, or you later want to move other processing steps to the GPU as well.
Some details may depend on the overall architecture: Is this supposed to be an ImageJ plugin in the end, or did you just use the code as an example for „a kernel that does some image manipulation“? How large will the images be? 100x100 pixels, or 10000x10000 pixels? …
Now I want to understand, can i send data using 8 threads to GPU?
You could do that, but should carefully think about whether this really makes sense. The GPU will, at this point, be like a „queue“. It can only process one image at a time.
In principle, you could set up a really sophisticated infrastructure there. You could add support for multiple GPUs, and use multiple CUDA contexts, and define own streams and stream callbacks to notify the Java threads when a computation is finished. But this is not entirely trivial. When you want to access the GPU with multiple Java threads, you always have to make sure that the Java thread is the „current“ thread for the respective context.
To put it that way: You should only invest the time here when you are reasonably sure that it will be worth the effort.
In any case, the rules of thumb that apply for plain Java programming also apply for CUDA/JCuda: You should not do things twice when it is sufficient to do them once.
This particularly refers to loading the setup and initialization: You should load the .PTX
file only once. This initialization is really expensive (compared to everything else). You can basically load it wherever you initialize CUDA: Where you create the context, you can load the CUmodule
and obtain the CUfunction
, and then use this function, again and again, until you’re done with the computation.
(I’m not deeply familiar with the „lifecycle“ of ImageJ plugins. I’d also have to look up and experiment how this could best be done for an ImageJ plugin).
Whether or not you have to re-allocate memory depends on whether the images all have the same size, or whether you know the maximum size. Memory allocation can also be expensive. If you really have some sort of „batch processing“ for images, you could have a pattern like this (pseudocode) :
for (Image image : images) {
Pointer data = allocateMemoryFor(image.size());
copyToGpu(image.pixels, data);
executeKernel(data, image.size());
copyToHost(data, image.pixels);
free(data);
}
If you already know that the images will all have the same size, or you already know the „maximum“ image size, then you could pull the allocation out of the loop:
Size maximumImageSize = maxSizeOf(images);
Pointer data = allocateMemoryFor(maximumImageSize);
for (Image image : images) {
copyToGpu(image.pixels, data);
executeKernel(data, image.size());
copyToHost(data, image.pixels);
}
free(data);
bye
Marco