There are many, many things to consider here.
First of all, I'm not entirely sure what you have measured. The "JOCLMultiDeviceSample" and the "JOCLSample" are similar, but have some structural differences that may make it difficult to compare them properly.
The next thing is: Measuring a time span of 0.03 ms is difficult in general, and especially in Java. Let the program run 10 times, and you will receive 10 different results. Adding at least a slight amount of "reliability" to a microbenchmark is difficult as well. You might have noticed that the MultiDevice-Example actually performs a reduction (and not a simple multiplication), mainly in order to have a duration that is not <1ms, but "several" milliseconds, and one really has a noticable delay caused by the computation.
In any case, what you have measured in the JOCLMultiDeviceSample might also be the execution time of the slowest device. So it might be that the GPU takes 1ms, and the CPU takes 10ms, and it reported 10ms as the "total" duration. I also think that you also might have a egneral mistake with the measurement (more on that below).
Also, in the MultiDeviceSample, profiling was enabled (with the CL_QUEUE_PROFILING_ENABLE-flag). In the simple version this was not the case. Although this should not have a large influence, it might also contribute to some irritation.
Concerning the question about whether the kernels run in parallel: I'm not sure what happens on your machine. One important point here might be that you seem to have one GPU and one CPU. I have no idea how Intel manages the case that both of these devices are used in the same time. I know that in some previous versions of NVIDIA's OpenCL, it was not possible to run kernels in parallel on multiple devices on Windows-Machines, so there are many, many unknowns that have to be taken into account here.
However, in the recent version of NVIDIA's OpenCL, it seems to be possible to run kernels in parallel: I tested the JOCLMultiDeviceSample on a 2-GPU machine, with a larger problem size. Both kernel executions took about 460ms, and the total execution time of the kernels was not 2*460ms, but only about 461ms, so the kernels have probably been executed in parallel. (This could possibly verified with the CUDA Visual Profiler, but unfortunately, the last version of this profiler does not work properly with Java Applications.)
But I think that the question about whether the kernels run in parallel also shows a misunderstanding: The call to
does what the name says: In enqueues the kernel for execution. It does NOT execute the kernel, but only says: "Execute this kernel as soon as possible". The call returns immediately. So for a call sequence like
long before = System.nanoTime();
clEnqueueNDRangeKernel(/* some very complex and time-consuming kernel */);
long after = System.nanoTime();
System.out.println("Total duration: "+(after-before)/1e6+"ms");
it will always print a ridiculously low time, for example, "0.029147ms" (this was the possible mistake that I mentioned above ). In order to measure the execution time of a kernel, one has to wait for the computation to be really finished. This is possible in several ways, for example, by using events, as it is done in the JOCLMultiDeviceSample. There, the loop
for (int i=0; i<numDevices; i++)
will enqueue the kernel for device0, device1 ... deviceN. Then they will start the computation immediately (in parallel, in the best case).
the program is waiting until ALL devices have finished the computation.
In any case, programming with multiple devices is certainly very challenging, and the synchronization of buffer reads/writes using events may be complicated. Nothing to easily get started with