Multiple image2d_t as arguments

system · 13. März 2016 um 01:48

Hi.

Just started to explore the nice world of GPU development.

I created a small program that resizes all my vacation pictures. I works and read pictures from the disc, resizes it and saves it back to disc again.

As of now I take 10 image2d_t as 10 arguments in a kernel. I did not found a solution to send image2d_t as a list or something to the kernel.

Is there a good way to send a list of image2d_t to the kernel?

Lets say I want to resize 200 pictures in a folder, then I dont want 200 arguments to the kernel.

Thanks!

Marco13 · 13. März 2016 um 05:29

There are several points to consider:

You will hit a hard limit there sooner or later. With clGetDeviceInfo you may query several device properties. Among them are CL_DEVICE_MAX_READ_IMAGE_ARGS or CL_DEVICE_MAX_SAMPLERS, which eventually will impose a limit of the number of images that you can reasonably use in a kernel.
These limits should hardly be relevant: The overhead of a single kernel launch should be negligible compared to all other operations that are performed there. So if you have many images in GPU memory, it should be perfectly reasonable to define the kernel to scale a single image, and call this kernel 200 times.
You have to consider that there are many operations involved:
a. Data has to be read from the disc (which is VERY slow, maybe 20ms)
b. A JPG decompression is done (which is slow, maybe 10ms)
c. The memory is copied to the device (which takes some time, maybe 1ms)
d. The actual resizing operation is done (with OpenCL - this is fast, maybe a few µs (!))
e. The resulting memory is copied back to the host (taking some time, maybe 1ms),
f. JPG compression (slow, mabye 15ms)
g. Writting back to the disc (extremely slow, maybe 30ms).
Again, the timings are completely made up, but the key point is that you have something that may have a fixed duration of ~70ms for I/O. Doing the rescaling with Java might already be pretty fast, depending on the parameters (I made some „„benchmark““ for image scaling operations - for example, scaling a 2560x1706 image down to thumbnail size usually takes less than 1ms). So I’m not sure whether OpenCL bring a noticable speedup here. But it might be worth a try, and be a good way to get started

system · 13. März 2016 um 07:00

Hi, thanks for fast answer

Ok, I do some refactoring and execute the kernel 1 time for each picture. Seems like the best way.

I look into your benchmarkning, seems very interesting. I did some simple yesterday also whey I removed the IO part and just resized an picture
over and over by 50% of the org size. Just to check the power of GPU. The result was crazy. An pic of 4800X3200 was resized by 50% 25’000times/second.
Just amazing.

*** Edit ***

Hi, thanks for fast answer

Ok, I do some refactoring and execute the kernel 1 time for each picture. Seems like the best way.

I look into your benchmarkning, seems very interesting. I did some simple yesterday also whey I removed the IO part and just resized an picture
over and over by 50% of the org size. Just to check the power of GPU. The result was crazy. An pic of 4800X3200 was resized by 50% 25’000times/second.
Just amazing.

*** Edit ***

Hi.

Just reworked it. Now executing an kernel for each pic

Took in total 131 sec for 203 pictures. 4800X3200 → 2400x1600

I know OpenCL is not needed to this type of tasks, the IO is simply to slow. I would get the same speed almost with CPU.

But its a good learing and still, the CPU is free for other tasks whille GPU is busy.

Some logging:

Init GPU:0.70000 sec

Read from disc:0.52500 sec
Exec kernel and read result to host:0.06500 sec
Saving pic(IO):0.32700 sec

Read from disc:0.50900 sec
Exec kernel and read result to host:0.03900 sec
Saving pic(IO):0.16300 sec

Read from disc:0.40700 sec
Exec kernel and read result to host:0.03800 sec
Saving pic(IO):0.18900 sec

Read from disc:0.46300 sec
Exec kernel and read result to host:0.03800 sec
Saving pic(IO):0.17700 sec

and more…

Marco13 · 13. März 2016 um 08:07

The interesting part may now be a comparison of the duration for

copy data from host to device
execute kernel
copy data from device to host
to the duration of
use Java (e.g. AffineTransformOp, as described in the linked answer) to scale the image

There likely are configurations where the OpenCL version is faster. For example, the case of
“Bilinearly scaling an INT_RGB image up to the size of 10000x???”
which took 1 second with plain Java. However, the larger the image is (10000 pixels wide in this case), the smaller is the advantage that may be achieved for the scaling operation compared to writing the resulting image to disc.

I’m still curious: How to you make sure that the input image has the right BufferedImage.TYPE_? Do you manually convert it to INT_ARGB? How to you access the pixel data in the end? (For this task, it would probably be reasonable to access the DataBufferInt directly).

BTW: There recently was a thread about a JPEG Decoder (in German) by @Spacerat . Although it referred to a “general” decoder, and was not related to OpenCL, I wondered whether it may be possible to achieve a speedup there by using the GPU. I also had a short look at things like https://github.com/richgel999/jpeg-compressor , but in any case, I guess this would be something that one couldn’t just implement without diving deeper into the JPG format specs…

anon19643277 · 13. März 2016 um 08:52

@Marco13 :
Noooo, you misunderstood!
Not the JPEG-Encoder itself is a general purpose, but the DT_Lib for which it was written is.

In the code you’ll find a class „Service“ within a method called „doIDCT“. The DCT (compressing) and IDCT (decompressing) are the most performance-expensive parts in JPEG and can/should possible be left out to other languages than Java (although I never got any JPEG which need longer than a second to decompress) e.g. OpenCL. In my DT_Lib I managed those things with services. If someone understands my code and want to implement it elsewhere… feel free.

BTW.: What ever I’ll do with BufferedImages, I’ll never touch it’s DataBuffer! I create an int array (width*height) instead and use the getRGB and setRGB methods. First, this is format independent and second the DataBuffer will be stay STABLE at all time.

system · 13. März 2016 um 11:22

[QUOTE=Marco13]I’m still curious: How to you make sure that the input image has the right BufferedImage.TYPE_? Do you manually convert it to INT_ARGB? How to you access the pixel data in the end? (For this task, it would probably be reasonable to access the DataBufferInt directly).

BTW: There recently was a thread about a JPEG Decoder (in German) by @Spacerat . Although it referred to a „general“ decoder, and was not related to OpenCL, I wondered whether it may be possible to achieve a speedup there by using the GPU. I also had a short look at things like GitHub - richgel999/jpeg-compressor: C++ JPEG compression/fuzzed low-RAM JPEG decompression codec with Public Domain or Apache 2.0 license , but in any case, I guess this would be something that one couldn’t just implement without diving deeper into the JPG format specs…[/QUOTE]

Yes, Im just setting it manully as BufferedImage.TYPE_INT_ARGB

*** Edit ***

[QUOTE=Marco13;130940]I’m still curious: How to you make sure that the input image has the right BufferedImage.TYPE_? Do you manually convert it to INT_ARGB? How to you access the pixel data in the end? (For this task, it would probably be reasonable to access the DataBufferInt directly).

BTW: There recently was a thread about a JPEG Decoder (in German) by @Spacerat . Although it referred to a „general“ decoder, and was not related to OpenCL, I wondered whether it may be possible to achieve a speedup there by using the GPU. I also had a short look at things like GitHub - richgel999/jpeg-compressor: C++ JPEG compression/fuzzed low-RAM JPEG decompression codec with Public Domain or Apache 2.0 license , but in any case, I guess this would be something that one couldn’t just implement without diving deeper into the JPG format specs…[/QUOTE]

Yes, Im just setting it manully as BufferedImage.TYPE_INT_ARGB

*** Edit ***

[QUOTE=Marco13;130940]How to you access the pixel data in the end? (For this task, it would probably be reasonable to access the DataBufferInt directly).
[/QUOTE]

Yes, Im doing just that

And the clEnqueueReadImage(…) to read it back to host again.

Marco13 · 13. März 2016 um 18:29

With „general“, I meant that it only is a usual JPEG decoder, to be run on the CPU (and able to decompress all(?) sorts of JPEG files)

I think there’s nothing wrong with using the DataBuffer in this case. But only in this particular case. Usually (namely, when the image should still be painted on the screen) obtaining the DataBuffer is not such a good idea. But I also asked this because of the point that you referred to: When you read an image, you never know what kind of DataBuffer it will have. So using an int may be more format-agnostic.

anon19643277 · 14. März 2016 um 04:20

[QUOTE=Marco13]and able to decompress all(?) sorts of JPEG files[/QUOTE]No, it doesn’t. Non hierarchical, non differential Baseline and Progressive JPEGs with huffman tables only at the moment.

The wrong things using DataBuffers of BIs are, habit-forming and format dependence. You’ll need to convert all images to TYPE_INT_ARGB just to get one int[].

Multiple image2d_t as arguments

Read from disc:0.52500 sec Exec kernel and read result to host:0.06500 sec Saving pic(IO):0.32700 sec

Read from disc:0.50900 sec Exec kernel and read result to host:0.03900 sec Saving pic(IO):0.16300 sec

Read from disc:0.40700 sec Exec kernel and read result to host:0.03800 sec Saving pic(IO):0.18900 sec

Read from disc:0.46300 sec Exec kernel and read result to host:0.03800 sec Saving pic(IO):0.17700 sec

Read from disc:0.52500 sec
Exec kernel and read result to host:0.06500 sec
Saving pic(IO):0.32700 sec

Read from disc:0.50900 sec
Exec kernel and read result to host:0.03900 sec
Saving pic(IO):0.16300 sec

Read from disc:0.40700 sec
Exec kernel and read result to host:0.03800 sec
Saving pic(IO):0.18900 sec

Read from disc:0.46300 sec
Exec kernel and read result to host:0.03800 sec
Saving pic(IO):0.17700 sec