2D Array Parallel Sum

bhupesh · 1. Februar 2014 um 11:23

Hi new poster on the forum, and have just started playing with JOCL. Having read quite a few samples, and lots out there for parallel summing arrays and using algorithms such as warpReduce to best make use of GPU to sum - I’ll get there and playing with that too, but to help get going I’m trying to implement something haven’t seen samples for yet. I have a vectors of arrays which I want to sum and reduce down to a single array. Seems most recommended advice is to convert to a one dimensional array and then sum treating as a single array, which is what I have done. Seems to work for small test arrays, but as I up the number of arrays to sum (keeping x elements the same, e.g. increase na to a higher number in sample) it fails. With the attached at 50, it sometimes succeeds and sometimes fails, which I assume is because the kernel hasn’t finished before reading results, although I do have a cl_finish in there. I’m sure it’s something simple, any help appreciated!

thanks

Marco13 · 1. Februar 2014 um 16:29

Hi

Disclaimer: It’s late and I’m tired, probably I shouldn’t answer but instead think about it once more :o

But from quickly looking over the source code, I think (!) I know what’s wrong there: You’re overwriting the values with multiple threads. The kernel contains


unsigned int gid = get_global_id(0);
unsigned int gidY = get_global_id(1);
c[gid] += a[gid+(gidY*numElements)];

You have to imagine that this is executed by nxna threads in parallel. So the threads with the “coordinates” (gid,gidY)=


(0,0)
(0,1)
...
(0,48)
(0,49)
(1,0)
(1,1)
...
(1,48)
(1,49)
...
(9,0)
(9,1)
...
(9,48)
(9,49)

will all execute these lines of code at the same time. Thus, in the computation


c[0] += a[0+(0*numElements)];
c[0] += a[0+(1*numElements)];
...
c[0] += a[0+(48*numElements)];
c[0] += a[0+(49*numElements)];
...

the value of ‘c[0]’ will be written by multiple threads concurrently.

I’m also not entirely sure whether ‘n’ and ‘na’ are always used correctly, but I’ll have to take a closer look at the code for this.

A GPU-friendly reduction is a little bit tricky. Something like http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/ may be a starting point. (Actually, I also needed a reduction kernel a few days ago - I’ll probably create a sample showing how the “Two-stage Reduction” that is mentioned there can be implemented in JOCL).

bye
Marco

BTW: When you are NOT using local memory, you can pass ‘null’ to the kernel as the ‘local_work_size’. OpenCL will then choose a local work size that is appropriate. Maybe I should also update the basic JOCL sample: The ‘local_work_size’ was mainly intended to make clear that this IS the local work size, but actually, passing {1} in there does not make sense…

bhupesh · 2. Februar 2014 um 06:32

Marco you’re spot on, should have spotted it myself given I was looking at the warp reduce algorithm for summing up a single array! A sample would be great I already started to convert the warpReduce sample to JOCL (it’s a bit quick and dirty). I’ve attached that, and will look at converting that to do adding multiple arrays rather than summing a single number, if you come up with a sample before that - would be great, I’m sure yours will be cleaner than mine!

Agree on the local work size thing, have been playing with that as well, but getting the examples for having this work for 2D confusing to start so I ended up playing with it a bunch of ways - would be good to put up this 2D (a proper working one not mine ) up on the main JOCL site.

Many thanks for the super quick response, I’ll keep checking for that sample if you can throw it up. Attached is my warpReduce converted sample, TODO is making it take the command line parameters and having the kernel fully templatised rather than hardcoded int etc (kernel file needs to be renamed to .cl not sure why upload didn’t allow me to do it)!

thanks
Bhupesh

Marco13 · 3. Februar 2014 um 01:41

The example that you posted it ported from the NVIDIA samples, right? (At least, the kernels are exactly that from NVIDIA). I see the intention of writing this as a benchmark with multiple data types, but am not sure whether a smaller example (e.g. only woring on floats, for simplicity) might not be more helpful. At least, I remember when I just wanted to do a “simple reduction”, and did not find any simple example. The NVIDIA samples have the tendency to be overly complicated due to different cases and approaches being messed together, “hiding” the basic idea (and in general, giving a sample a “nice and clean” structure while still keeping it a single, standalone class is difficult). However, I’ll probably start with a simple, 1D reduction in JOCL, and then see how this can be extended for 2D.

bhupesh · 3. Februar 2014 um 02:24

Hi Marco,

Thanks for your response, it is indeed from the NVIDIA samples (looks like the headers of munged when copying pasting to txt file when I uploaded :-(). Agree these can be complicated (took me a while to work through and port from the c++).

I think a simple 1D and 2D reduction kernel would be great (especially one that worked on arrays :-)) so if you are able to do that would be super helpful to a lot of people.

In the meantime I’ll plug away on this as well.

Thank you so much for taking the time to respond and help, appreciate it!

Thanks
Bhupesh

Marco13 · 3. Februar 2014 um 10:38

I already created a reduction sample for JCuda ( http://jcuda.org/samples/samples.html , JCudaReduction.java ), maybe you want to have a look, it’s conceptually rather similar to OpenCL. But in any case, I’ll try to polish and upload the OpenCL example today or tomorrow

*** Edit ***

The basic reduction example has been uploaded to http://jocl.org/samples/samples.html :
Main: http://jocl.org/samples/JOCLReduction.java
Kernel: http://jocl.org/samples/reduction.cl

I’ll have to check whether I can create a similar example for 2D. I’m not sure about the use-cases. The “Two-stage-reduction” works by computing a reduction on the GPU, reducing N input elements to numWorkGroups values (where each chunk is treated by localWorkSize threads, giving numWorkGroups*localWorkSize threads in total). The resulting values are then summed up on host side. For a 2D array, the size of the array in x-direction may be…

very small: Then the benefit of reducing these few elements on the GPU may be not so great
very large: Then one could manually reduce each “row” of the array with a 1D-reduction
But it might be worth a try, anyhow…

bhupesh · 3. Februar 2014 um 12:42

yeah i think the reduction determination could be made based on the size of the array of vectors. So lets say i had and array of vectors with each vector being 100 elements, and i had 100,000 of such vectors. A reduction array that treated each row of that vector to essentially reduce down to a single vector which was the sum of all vectors i think would reduce very well as many of the intermediate results could be calculated in parallel without any locking, waiting etc. etc. so would work as nicely as the 1D reduction. I guess you’re saying the same thing in terms of treating it as a 1D reduction, and that is where I was going with my sample initially by turning it into a 1D array, but then I need to work out the element indexing to reduce to a single vector rather than a single number, this is what you mean right?

I’ll dig into your sample now as well… thank you so much for posting

thanks

Marco13 · 3. Februar 2014 um 14:17

OK, maybe that’s the crucial point: I think that when someone wants to reduce a 2D-array row-wise (so that the result is an array, where each entry is the sum of one row), then the difference is between

reducing a 2D array with 100 rows and 100000 columns
reducing a 2D array with 100000 rows and 100 columns and
is important.
When there are “few” rows, and each row is a “large” array, then performing a 1D reduction for each row would be feasible.
When there are “many” rows, and each row is a “small” array, then performing a 1D reduction for each row would be NOT feasible.
But in general, what you described sounds as if your intention was to perform a “simple” reduction, but instead of summing up single float values, you wanted to sum up whole vectors. Is that correct? I think that this might impose some pressure concerning local memory…

bhupesh · 7. Februar 2014 um 15:58

Hi Marco,

Sorry for the late reply here, but yes you’re right, depending on the dimension the optimisation is different. In my cause the use case I was thinking of was the 2nd one you mention. So I have very large number of rows say 1million, and small columns, around the 100 mark, and I was wanting to reduce this to 1 row with 100 columns, and i thought doing this in parallel would be very good GPU use case here. The only other way I could do this is simple reduction but 100 times, for each element in the vector creating 1 array of a million vectors. It basically means restructuring the shape of the data in a different way but would work I think… I think I see what you’re saying on the local memory given the high number of reductions but thought that may be the most elegant solution? Did you have any further thoughts on this?

Many thanks
Bhupesh

Marco13 · 7. Februar 2014 um 17:46

Also sorry for that I have not yet tried to continue with this. There’s so much in the queue.
However, there are certainly some tuning parameters. In the current implementation, the amount of local memory that is required may be chosen rather arbitrarily, namely by choosing the work group size. But of course, it does not make sense to reduce the work group size to a “very small” number just in order to be able to treat larger vectors. In any case, IF one wanted to implement something like this, the goal would probably be to maximize the work group size in view of the required local memory size, and I’ll probably do some experiments for this in the next few days.

Marco13 · 13. Februar 2014 um 11:18

OK, I tried to implement such a 2D reduction. But even for relatively “small” column widths, the amount of available local memory severly limits the possible local work size. Performing a reduction on the GPU with 16 threads or so does not really make sense. Maybe I’ll try to continue with that later. One could probably split the columns in a way that allows increasing the work size. But this would require additional synchronization of local memory writes in the kernel, and is most likely too much effort if the intention only is to create some sample code. For a real use-case, I’d go with the column-wise 1D reduction for now.