Parallel programming with Opencl ?

duymap · 28. Juni 2013 um 22:18

Hi all,

Now my laptop has 3 items:

1 Intel CPU ( core i5 )
1 Intel GPU ( internal HD graphics 4000)
1 Nvidia GPU ( external )

I tried to test each all devices using JOCLSample, everything is fine.

Now I have the problem with parallel programming. I don’t know how to implement parallel programming in JOCL. Because OpenCL has this ability but I don’t see any example for that. Do everyone has any idea about this ? I have an idea to put all devices into one context like this idea http://dhruba.name/2012/10/14/opencl-cookbook-how-to-leverage-multiple-devices-in-opencl/, but it does not work in JOCL. I appreciate if someone can show me how bcos I am stucking in my project assignment “how to apply parallel processing in OpenCL” :(.

Many thanks,
Duy.

Marco13 · 29. Juni 2013 um 06:14

Hi

It seems like I did not understand your question. Every example from jocl.org - Samples already IS using „parallel programming with OpenCL“. In fact, it’s hard to use OpenCL and to NOT do „parallel programming“

Or was your question specifically about how to use multiple devices?

bye
Marco

duymap · 29. Juni 2013 um 18:44

[QUOTE=Marco13]Hi

It seems like I did not understand your question. Every example from jocl.org - Samples already IS using „parallel programming with OpenCL“. In fact, it’s hard to use OpenCL and to NOT do „parallel programming“

Or was your question specifically about how to use multiple devices?

bye
Marco[/QUOTE]

Hi Marco,

Thanks for your quick response. I know OpenCL support parallel programming. But in these samples, these sample just use one device for computing ( deviceIndex ). So my questions are:

If we just want to run OpenCL on one selected device, since OpenCL running, it runs parallelly , right ?
I meant how we use multiple devices in OpenCL ? I tried put mutilple devices in one context, looks like my way is incorrect ? And if we can put multiple devices in one context, did we did one command for each device ?

I tried JavaCL and reallize that your JOCL works well with each device. But I don’t know how to use mutiple devices at the same time in computing.

Because I have just started with OpenCL and everying is new for me. So if I am wrong in something, please correct me.

Thanks,
Duy.

Marco13 · 1. Juli 2013 um 03:50

That’s right.

I meant how we use multiple devices in OpenCL ? I tried put mutilple devices in one context, looks like my way is incorrect ? And if we can put multiple devices in one context, did we did one command for each device ?

Yes, when using multiple devices, you usually create one command queue for each device. Admittedly, I have not yet made extensive experiments with multiple devices, only some first tests with a 2-GPU-Machine. However, the details of properly handling multiple devices may be tricky. One aspect is that you never really know where memory is allocated. When you create an OpenCL buffer, it is „virtually“ (!) created on ALL devices, and thus may have to be copied or updated between the devices when they are actually concurrently using this buffer.

Apart from that, using multiple devices with JOCL is exactly the same as with plain OpenCL. The ‚Samples‘ page currently contains no simple example that really uses multiple devices, but maybe I can add one when I find the time.

bye

duymap · 1. Juli 2013 um 07:43

Hi Marco,

Thanks for your comments. I will keep researching how to implement that on multiple devices. But as you said, have you did some tests with 2 GPUs, so could you please give me some instructions how to test with 2 GPUs, because my laptop had 2 GPUs and I can try. If I found something about working with multiple devices, I will update you.

Thanks,
Duy.

Marco13 · 1. Juli 2013 um 16:30

I can try to create a small multi-device-sample tomorrow (i.e. later today), I think I already have most of the relevant code somewhere, just have to polish it a little.

duymap · 1. Juli 2013 um 18:40

Thanks a lot, Marco !!!

Marco13 · 2. Juli 2013 um 09:05

Hello

I have uploaded a very basic example showing how to use multiple devices at http://jocl.org/samples/JOCLMultiDeviceSample.java .

BTW: I have seen your post at the JavaCL mailing list about combining GPU- and CPU devices. The main problems is that it is not possible to create a single context that contains devices from different platforms. If you want to employ multiple devices from different platforms, you probably have to create multiple contexts, and probably have to copy data between those devices via the host, but I have not yet tried this, and it might involve a little more effort.

The sample linked above creates a context containing all devices of one specific platform. Thus, for example, if you have two devices in platform 0 (e.g. the Intel GPU and the Intel CPU) it should employ both of them (although the usage pattern in the sample is a very simple one, which will hardly be the case in real-world applications).

Maybe I can try to create a sample using devices from multiple platforms (managed in different contexts), but I’ll first have to try this out myself, and can not even remotely say when I can do this.

bye
Marco

duymap · 2. Juli 2013 um 16:09

[QUOTE=Marco13]Hello

I have uploaded a very basic example showing how to use multiple devices at http://www.jocl.org/samples/JOCLEventSample.java .

BTW: I have seen your post at the JavaCL mailing list about combining GPU- and CPU devices. The main problems is that it is not possible to create a single context that contains devices from different platforms. If you want to employ multiple devices from different platforms, you probably have to create multiple contexts, and probably have to copy data between those devices via the host, but I have not yet tried this, and it might involve a little more effort.

The sample linked above creates a context containing all devices of one specific platform. Thus, for example, if you have two devices in platform 0 (e.g. the Intel GPU and the Intel CPU) it should employ both of them (although the usage pattern in the sample is a very simple one, which will hardly be the case in real-world applications).

Maybe I can try to create a sample using devices from multiple platforms (managed in different contexts), but I’ll first have to try this out myself, and can not even remotely say when I can do this.

bye
Marco[/QUOTE]

Hi Marco,

I did take a look and your code looks like is using one context for the selected device, not multiple devices.

```   // Obtain a device ID 
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
    cl_device_id device = devices[deviceIndex];

    // Create a context for the selected device
    cl_context context = clCreateContext(
        contextProperties, 1, new cl_device_id[]{device}, 
        null, null, null);```

I’m not sure if I missed something, but I looked at 3 times just saw one context created from one device. I think you made changes but may be you forgot to update this file ? If I am wrong something, please let me know.

Thanks,
Duy.

Marco13 · 3. Juli 2013 um 01:27

Oh, sorry, I linked the wrong file. The right one is http://jocl.org/samples/JOCLMultiDeviceSample.java (it’s also linked properly from the samples main page http://jocl.org/samples/samples.html ). I’ll update the link in the previous post accordingly.

duymap · 6. Juli 2013 um 08:33

Hi Marco,

I had some testings and comparison from your files: url]http://jocl.org/samples/JOCLMultiDeviceSample.java and JOCLSample.java. I have modified them to process the same kernel is „multiple 2 vectors“, the size of vector = 10000. So below is the result:

JOCLMultiDeviceSample.java : total running time: 1.942587ms
JOCLSample.java ( Intel HD graphics 4000 ): 0.029147ms

So why we use opencl in one device is faster ? I did a look your example multiple devices again and looks like we have 2 multiple devices but they ran separately, not parallely. It means device0 process the kernel and device1 also process the same kernel separaly, both of them don’t involve to process the one kernel parallelly, is that right ?

Thanks,
Duy.

Marco13 · 6. Juli 2013 um 11:17

Hello

There are many, many things to consider here.

First of all, I’m not entirely sure what you have measured. The „JOCLMultiDeviceSample“ and the „JOCLSample“ are similar, but have some structural differences that may make it difficult to compare them properly.

The next thing is: Measuring a time span of 0.03 ms is difficult in general, and especially in Java. Let the program run 10 times, and you will receive 10 different results. Adding at least a slight amount of „reliability“ to a microbenchmark is difficult as well. You might have noticed that the MultiDevice-Example actually performs a reduction (and not a simple multiplication), mainly in order to have a duration that is not <1ms, but „several“ milliseconds, and one really has a noticable delay caused by the computation.

In any case, what you have measured in the JOCLMultiDeviceSample might also be the execution time of the slowest device. So it might be that the GPU takes 1ms, and the CPU takes 10ms, and it reported 10ms as the „total“ duration. I also think that you also might have a egneral mistake with the measurement (more on that below).

Also, in the MultiDeviceSample, profiling was enabled (with the CL_QUEUE_PROFILING_ENABLE-flag). In the simple version this was not the case. Although this should not have a large influence, it might also contribute to some irritation.

Concerning the question about whether the kernels run in parallel: I’m not sure what happens on your machine. One important point here might be that you seem to have one GPU and one CPU. I have no idea how Intel manages the case that both of these devices are used in the same time. I know that in some previous versions of NVIDIA’s OpenCL, it was not possible to run kernels in parallel on multiple devices on Windows-Machines, so there are many, many unknowns that have to be taken into account here.

However, in the recent version of NVIDIA’s OpenCL, it seems to be possible to run kernels in parallel: I tested the JOCLMultiDeviceSample on a 2-GPU machine, with a larger problem size. Both kernel executions took about 460ms, and the total execution time of the kernels was not 2*460ms, but only about 461ms, so the kernels have probably been executed in parallel. (This could possibly verified with the CUDA Visual Profiler, but unfortunately, the last version of this profiler does not work properly with Java Applications.)

But I think that the question about whether the kernels run in parallel also shows a misunderstanding: The call to
clEnqueueNDRangeKernel
does what the name says: In enqueues the kernel for execution. It does NOT execute the kernel, but only says: „Execute this kernel as soon as possible“. The call returns immediately. So for a call sequence like

long before = System.nanoTime();
clEnqueueNDRangeKernel(/* some very complex and time-consuming kernel */);
long after = System.nanoTime();
System.out.println("Total duration: "+(after-before)/1e6+"ms");

it will always print a ridiculously low time, for example, „0.029147ms“ (this was the possible mistake that I mentioned above ). In order to measure the execution time of a kernel, one has to wait for the computation to be really finished. This is possible in several ways, for example, by using events, as it is done in the JOCLMultiDeviceSample. There, the loop

for (int i=0; i<numDevices; i++)
{
    ....
    clEnqueueNDRangeKernel(..., events**);
}

will enqueue the kernel for device0, device1 … deviceN. Then they will start the computation immediately (in parallel, in the best case).
Then, with

clWaitForEvents(events.length, events);

the program is waiting until ALL devices have finished the computation.

In any case, programming with multiple devices is certainly very challenging, and the synchronization of buffer reads/writes using events may be complicated. Nothing to easily get started with

bye

duymap · 7. Juli 2013 um 19:10

Hi Marco,

I double checked again and realized that I got the mistake as you said. I put time calculation after clEnqueueNDRangeKernel only and this caused the mistake. Actually, I should calculate time based on event. I modified code again based upon event and test , the total running time seems reasonable. Thanks for correcting me, you helped me so much.

BTW, I have a question why we just read output data from first device ? We had 2 outputMems, but why we just read from first one ?

        clEnqueueReadBuffer(commandQueues[0], outputMems[0], CL_TRUE, 0,
            n * Sizeof.cl_float, Pointer.to(output), 0, null, null);```

Thanks,
Duy.

Marco13 · 8. Juli 2013 um 00:13

Well, they could be read both. I thought it simply does not really matter here.

But again: This is only a sample or a basic test. It does not really make sense: Here, one set of input/output data is created for each device. In a real application, one would rather have a single data block, and like to distribute it among the available devices. The task of really copying the required data to the devices is then done transparently by OpenCL, and the synchronization between the devices becomes more challenging.

ag123 · 7. Oktober 2019 um 21:59

actually opencl does more than just ‚parallel‘ programming. conventionally to run tasks in parallel we run multiple threads and maybe for the threads to run separately on each cpu core. the magic of opencl is vector programming, the magic is in the kernel, if that can be vectorised well it could very much achieve those TFlops (mostly single precision) troughput advertised for those high performance GPUs
https://handsonopencl.github.io/
there is a tradeoff though, if your workpiece is very small, the interfacing overheads can be so large that there is no advantage doing so, it doesn’t make sense to call the gpu 1million times to add 2 numbers it is probably slower than running it on the cpu alone. but if you have 2 arrays of 1 million numbers, send them to the gpu and the gpu can probably complete that in fractions of nanoseconds

Marco13 · 7. Oktober 2019 um 23:05

@ag123 Thanks for the link - maybe I can port some of the exercises and solutions to become samples for JOCL

Regarding the tradeoff that you mentioned: I wrote a bit more about „data parallel“ and „task parallel“ programming in the stackoverflow answer that I already linked to. But in fact, even having a „large number“ of numbers is not sufficient for a good speedup. When you have 2 arrays, each with 1 Million numbers, and only want to do something as simple as adding these numbers, then doing this on the GPU will still not bring an advantage: Copying the data between the host and the device is just too expensive. (There are approaches to alleviate this problem, e.g. SVM, but these are not always so trivial to use…)

ag123 · 8. Oktober 2019 um 00:15

actually i’m very much a novice with opencl too, these days jvm with their jit compilers are literally very fast, for some ops i read some articles that they are comparable to compiled c codes, not all cases though. for the general bulk, if everything is written in java, there are still varous overheads.
i’d think opencl would really help if the problem at hand benefit from vectorization. this would be relevant to cases like various statistical analysis, maybe clustering (e.g. k means) and for things like convolutional neural nets where the arrays are very large. one of those ways the speedups is achieved i think is by means of sgemm or similar. but using opencl is simply better as the interface and api is very well defined and on top of that one has access to the gpu which probably can do much better vector processing for massive / large arrays

ag123 · 8. Oktober 2019 um 08:29

i did some rather naive / cruide matrix multiplication tests linked here