Function parameters as Java arrays vs direct ByteBuffers

Piotr · 21. September 2014 um 11:24

Hi Marco,

I noticed JOCL.org is using Java arrays for passing and retrieving parameters, while eg JOGAMP.org JOCL is using direct ByteBuffers. Java arrays are more convenient, but FWIU JOGAMP is using cached (or rather pooled?) direct ByteBuffers for performance reasons. I wonder, what’s the performance hit by using Java arrays instead of direct ByteBuffers during JNI calls? Did you measure that? Do you plan to allow direct ByteBuffers instead of Java arrays in the future?
IIRC using plain Java arrays during JNI calls prevents garbage collection and replacing Java arrays with direct ByteBuffers solves that problem.

BTW I’ve stumbled upon a paper about evaluating Java OpenCL bindings: http://e-archivo.uc3m.es/handle/10016/17183 It seems that author mixed up something in his paper (ie it seems he mistaken JOCL with JOGAMP in one part of paper) but overall it seems plausible.

Marco13 · 22. September 2014 um 11:08

Hello Piotr,

There are two different usage patterns where one has the choice between arrays or (direct) ByteBuffers:

The first one is for the basic API usage, where the arrays only involve „few objects“. For example, when obtaining the available platforms with

clGetPlatformIDs(n, platformsArray, null);

The second one is the transfer of the actual data that is processed by a kernel. For example, in

clEnqueueReadBuffer(..., pointerToArray...);

These uses are rather different.

Regarding 1.:

When I started JOCL, I already knew JOGL (I think the Jogamp-JOCL did not exist back then, but am not sure), and other JNI-based libraries. And I always found it a bit inconvenient having to create direct ByteBuffers. This can be particularly annoying for for „small“ arrays, considering the choice between

and

float data[] = new float[] { 1, 2, 3 };
ByteBuffer bb = ByteBuffer.allocateDirect(data.length * 4);
bb.order(ByteOrder.nativeOrder());
FloatBuffer fb = bb.asFloatBuffer();
someMethod(fb);

Of course, the latter is calling for some convenience/utility methods. But this may have the drawback that a potentially very large number of small, temporary direct ByteBuffers had to be created, causing the GarbageCollector to run mad. (You mentioned that Jogamp-JOCL does some caching/pooling, which is the obvious „solution“ of this problem, but I can’t tell from the tip of my head how exactly they solved this). In any case, I thought that plain Java arrays are a bit more convenient. (Still, they are a bit of a hassle compared to the simplicity of the original C API: The places where JOCL uses these arrays (like the platforms example above) usually correspond to pointers in C. So in C you’d not necessarily pass an array to such a function, but simply the address of a single variable).

Concerning the performance: I actually can’t remember having made detailed, dedicated performance tests for these „small-array-cases“. I’m rather sure that the performance difference will not be significant, but even if using arrays causes an overhead here, I think that this can justified, particularly for JOCL: The most time-consuming part of an OpenCL program will usually not consist of millions of calls to functions like clGetPlatformIDs. Instead, most of the time will be used for copying memory and running kernels.

One could probably imagine usage patterns where any potential overhead that is imposed by any function call may become more important. Regardless of the fact that for these cases, the difference between using arrays and using ByteBuffers will probably still be negligible compared to the overhead that is imposed by the JNI call itself: If I become aware of such a pattern, and find out that offering one particular method (or several methods) in an overloaded form that alternatively accepts ByteBuffers, I’ll certainly consider adding these methods.

Until then, I’d rather create a dedicated test to find out how much difference there acutally is between small arrays and ByteBuffers (Something similar has been on my „to do“ list for JCuda for years now, but it did not really have high priority).

Regarding 2.:

For the actual data transfer, direct ByteBuffers can already be used. So you can already write

clEnqueueReadBuffer(..., pointerToByteBuffer...);

using either a direct ByteBuffer, or a Heap-Based ByteBuffer. So for this case of the actual „data blocks“, one has the option to use either arrays or direct ByteBuffers.

However, it is correct that using arrays can cause some headaches in combination with garbage collection, as you mentioned in

IIRC using plain Java arrays during JNI calls prevents garbage collection and replacing Java arrays with direct ByteBuffers solves that problem.

Direct ByteBuffers are allocated outside the heap, and not directly touched by the garbage collector, which makes it easier to use their actual adresses on the native side. In contrast to that, a Java array has to be „pinned“ on the native side, to prevent it from being garbage collected. Whether or not „pinning“ is supported depends on the JVM, and there is no way to find this out reliably. This also means that some function calls that involve Java arrays have to be blocking, because it is not really feasible to pin an array across multiple JNI calls.

But I think that particularly for these „data blocks“, there is one rather compelling reason why I tried hard to support plain Java arrays as well: Namely, the interoperability with existing programs. A usual Java Program that does some number crunching will never-ever use (direct) ByteBuffers/FloatBuffers. All the methods will return plain float[] arrays, or accept them as their arguments. I think that it can be advantageous to have the possibility to pass these arrays directly to OpenCL, without first having to copy them into a direct ByteBuffer, and afterwards copying the results from a ByteBuffer back into an array. (Still, there are some unknowns, namely the actual handling of the arrays in JNI concerning pinning, but it’s at least a best-effort approach to avoid unnecessary copies of larger memory blocks).

Thanks for the link to the paper, that looks interesting, and I’ll definitely have a look at this!

bye
Marco

Marco13 · 23. September 2014 um 05:30

A short side node concerning the paper/report that you mentioned: I did not read it in all detail. But from what I have seen so far, it seems to be questionable (in fact, very questionable) in many ways.

Of course, it seems to be „only“ a report about a student project. I don’t want to argue about whether the (rather generic) process of adding a JAR to a project is worth such a detailed description. I certainly do not care about things like spelling mistakes etc. But when source code is inserted as a sequence of secreen shots, this already looks a bit dubious.

However, there are some really serious issues. The first one is the one that you already pointed out: It indeed looks like Jogamp-JOCL and JOCL have been mixed at some point. One could argue that the names „JOCL“ and „JOCL“ don’t make it particularly easy to distinguish the libraries but something like this simply should not happen. There is not much said about how the time measurements have been made - except for the code that is shown in the screenshots. And if this is the basis for the time measurement, then one has to say that this is not only questionable, but plainly wrong. The Java-based time measurement will mainly measure the duration of creating new Random instances, without taking the JIT into account. The OpenCL-based measurements do not differentiate between the execution time and the time for the memory transfers (and also not the method invocation overhead, which you implicitly asked for, and which may in fact be interesting as a comparison between JNI- and JNA-based approaches). And finally, a single, element-wise vector multiplication is certainly not a very sensible benchmark for OpenCL.

So I’m not sure what the intention of this report was. Maybe it is in line with the tasks/goals that have been defined in the respective assignment. But I frankly have to say that I, personally, would not consider it as a really credible source of information.

Piotr · 25. September 2014 um 11:24

Thanks for the answer. I looked at API again and there are quite a few places where buffers are used. However, eg clEnqueueNDRangeKernel takes few Java arrays. Competing APIs either use buffers exclusively or introduce some overloading like clEnqueue1RangeKernel, clEnqueue2RangeKernel, etc JOCL doesn’t do that, but OTOH JOCL’s API is very similar to plain C API and that makes porting code snippets a breeze.

I think the arrays passing can be accelerated without API changes. You could use a ThreadLocal direct ByteBuffer of size eg 4 KiB (one page) and for functions where parameters fit within that buffer, use it for parameter passing. That mechanism could be done transparently to user and also support enabling/ disabling in runtime. If there are few small Java arrays in parameters, create subbuffers from the thread-local byte buffer and remove them before returning from JNI call.

Since clEnqueueNDRangeKernel is not made to use buffers for parameters (ie API takes Java arrays) it could be used for benchmarking. Naive version of bitonic sort requires O(lg n * lg n) kernel invocations. Multiply that by sorting M blocks at once and you have quite a lot of clEnqueueNDRangeKernel to push. Also there could be many command queues operated from different threads to check how using Java arrays vs direct ByteBuffers affect concurrency.

What do you think about it?

Update:
BTW, AMD just released an OpenCL 2.0 driver: http://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx

Marco13 · 25. September 2014 um 15:22

Yes, “clEnqueueNDRangeKernel” could obviously be called many times for many small kernel invocations, and its overhead should be as small as possible. I’ll try to benchmark this, and IF it turns out to be a potential bottleneck, consider some of the alternatives that you mentioned. But now, a new task with higher priority has entered my queue: The update for OpenCL 2.0 (thanks for this hint!). I already started the update based on the Spec and the Khronos header files, but without the ability to test the new functions, it’s hard to be confident that the results are right…

Piotr · 28. September 2014 um 11:27

OK, no problem, OpenCL 2.0 support is also very good.

In addition to call overheads the blocking aspect should be explored, as I’ve said. Particularly - what if some thread executes blocking OpenCL call which involves arrays that are mapped to Java arrays by JOCL API? Would that stall entire JVM with all other threads? I don’t really know what are the downsides of passing objects from Java heap (and Java arrays are such objects) to JNI.

Marco13 · 29. September 2014 um 02:18

Th blocking aspect refers to the „data blocks“ that are read from or written to cl_mem objects. For the small arrays (like the parameters that are passed to clEnqueueNDRangeKernel) there is no blocking.

The exact behavior of a JVM concerning pinning is rather hard to grasp. The actual documentation only contains some rather fuzzy, general statements like that „some JVMs support pinning, and others don’t“. All this also depends on which Garbage Collector is used. In any case, concerning the question

Would that stall entire JVM with all other threads?

No. Pinning only means that the respective memory block may not be garbage collected. Again, it’s hard to say how is done exactly. But it will (or should) not needlessly block other threads.

Piotr · 27. Oktober 2014 um 11:31

Regarding AMD’s OpenCL 2.0 driver (RN-WINLIN-OPENCL2-14.41): they’ve updated product compatibility list and it seems my card (Radeon HD 7870) is not on the list. Perhaps the original GCN 1.0 architecture isn’t compatible with OpenCL 2.0 (which is weird and sad).

Marco13 · 27. Oktober 2014 um 14:07

They are still working on it. AMD is usually eager to provide support/updates for OpenCL (at least, compared to NVIDIA :rolleyes: for them, CUDA should prevail…). So one can assume that this list will still be extended