First of all: Great that you provided the project as a package that can simply be opened and run. It’s always difficult when people just dump a random code snippet and as “Why this not worx?”. Even if one can allocate the time to set it up so that it can be run and tested, there’s always the chance that people will afterwards tell details that they didn’t mention before, and the effort was useless. So I really appreciate a test case where the problem can easily be reproduced!
Now, investigating the problem took a bit longer than I expected, though: I started it a few times, and the output seemed to be random. That’s why I took a wrong path of debugging: I suspected a race condition, and looked at the atomicAdd
and the output.
But in the end, the problem is in the input:
When the image is loaded, the resulting BufferedImage
will have the type TYPE_3BYTE_BGR
. This means that each pixel consists of three bytes instead of four (and not even in the expected order). See, for example,
System.out.println("pixels "+pixels.length);
System.out.println("3? "+(raster.getWidth()*raster.getHeight()*3));
System.out.println("4? "+(raster.getWidth()*raster.getHeight()*4));
which prints
pixels 786432
3? 786432
4? 1048576
The whole downstream pipeline assumed that the pixels consisted of four bytes: The memory allocations, the copies, and eventually the uchar4
in the kernel. (So the “random” results had been caused by some random, uninitialized memory…)
The problem does not appear in the CPU version, because
raster.getPixel(x, y, rgb);
will do the magic conversion to RGB internally.
Now, how to solve this?
That basically depends on which image type you’d like to operate on. When loading an image with ImageIO
, you never know the exact image type. (I think that ImageJ will do a sensible conversion, and expose the pixels
as an RGB int[]
array)
One solution that worked in the given example was:
Convert the input image to a 4 byte format, for example, with this method:
public static BufferedImage convertToTarget(BufferedImage image)
{
BufferedImage newImage = new BufferedImage(
image.getWidth(), image.getHeight(),
BufferedImage.TYPE_4BYTE_ABGR);
Graphics2D g = newImage.createGraphics();
g.drawImage(image, 0, 0, null);
g.dispose();
return newImage;
}
And in the kernel, as you converted to TYPE_4BYTE_ABGR
, you could do
// pixel.x is the alpha channel - ignored here
b = pixel.y;
g = pixel.z;
r = pixel.w;
A side note:
Usually, I use this method
public static BufferedImage convertToARGB(BufferedImage image)
{
BufferedImage newImage = new BufferedImage(
image.getWidth(), image.getHeight(),
BufferedImage.TYPE_INT_ARGB);
Graphics2D g = newImage.createGraphics();
g.drawImage(image, 0, 0, null);
g.dispose();
return newImage;
}
to convert all images that I load with ImageIO into INT_ARGB images. This type is usually the fastest for drawing- and pixel manipulation operations. From this image type, you can fetch a DataBufferInt
and its int[]
array (you’d have to change the execute
method accordingly).
Based on the int
pixels, you could also do
int r = (pixels[i] >> 16) & 0xFF;
int g = (pixels[i] >> 8) & 0xFF;
int b = (pixels[i] ) & 0xFF;
to extract the RGB components on Java side, which should be significantly faster than raster.getPixel
.
EDIT: With the changes described above, the output for me was
Skintone Using CPU: 75417
Total Pixel Of CPU: 262144
SKinTone Percentage Using CPU: 29%
Skintone Using GPU: 75418
Total Pixel Of GPU: 262144
SKinTone Percentage Using GPU: 29%
Where this off-by-one error comes from? Well, maybe I’ll try to figure that out later…
In any case: You should also clear the output memory that you allocated (otherwise, the output[0]
might already contain a non-zero value, and you’re only adding values to this in the kernel). For example:
// Allocate device output memory
CUdeviceptr deviceOutput = new CUdeviceptr();
cuMemAlloc(deviceOutput, numElements * Sizeof.INT);
// Clear the output memory with zeros:
int hostOutput[] = new int[numElements];
cuMemcpyHtoD(deviceOutput, Pointer.to(hostOutput), numElements * Sizeof.INT);
...
// Allocate host output memory and copy the device output
// to the host.
cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput, numElements * Sizeof.INT);