GetPrimitiveArrayCritical and pointers to pointers

EDIT: This post has been a response in http://forum.byte-welt.net/threads/11248-Deadlock?p=82674&viewfull=1#post82674 , and the resulting discussion has been moved to this thread, because it was only remotely related to the original topic

Google is pretty empty on these topics. The behavior of Get*Critical differs significantly among GCs, and its use in general is discouraged, but it’s hard to find information on it. I agree this would probably be a bug in the JVM. If I can get more detailed info using a native debugger, I’ll report it to Oracle. I’m experiencing stalls in another JNI library I’m using, and I wonder if it’s related. Of course, Java 8 isn’t released yet, so it wouldn’t be too surprising to find such bugs.

Otherwise, use of Get*Critical can be avoided with direct byte buffers and copying. Large host arrays should be allocated through cuda and kept in a direct buffer anyway.

Sure, there are reasons for the “Critical” part in the function name -_- However, I have never seen that it really has been “discouraged” - are there any particular reasons for that? Of course, the Get*Critical can be avoided with direct buffers. And many JNI libraries (primarily referring to things like JOGL, LWJGL) are enforcing the use of direct buffers, and don’t allow arrays at all. This may not so much be a problem when one designs an application from scratch. But … love’m or hate’m - arrays are part of the Java Language, and used excessively, especially in performance-critical applications, where large amounts of data are plainly stored in some float[] arrays. And I always thought that the possibility to copy a Java array “directly” to the device (i.e. without introducing another overhead of first copying it into a direct buffer, and this one to the device) is an important feature. Omitting this possibility would have made many things MUCH easier, but I thought that it should be worth it (although I have to admit that I only made a few small tests in the beginning, and no elaborate benchmarks for the overhead that is caused by an additional array->directBuffer copy in different application patterns).

The most certain negative of *Critical functions is they pause compacting garbage collectors. Compaction means the GC can move large objects to make room for other large objects. A non-compacting collector will throw an OutOfMemory error when there is too much fragmentation, even if plenty of the heap (eg 30%) is free. G1 is a compacting collector. Concurrent mark sweep is not.

Because the GC is paused, if another thread tries allocating an object and a GC is required, it will stall until the critical section is exited (ie, ReleasePrimitiveArrayCritical). If the thread holding the critical section itself tries allocating a java object (perhaps indirectly) and a GC is required, or if it waits on a Java thread that’s already stalled, it will deadlock.

From the javadoc:

The semantics of these two functions are very similar to the existing Get/ReleaseArrayElements functions. If possible, the VM returns a pointer to the primitive array; otherwise, a copy is made. However, there are significant restrictions on how these functions can be used.

After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical. We must treat the code inside this pair of functions as running in a „critical region.“ Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)

These restrictions make it more likely that the native code will obtain an uncopied version of the array, even if the VM does not support pinning. For example, a VM may temporarily disable garbage collection when the native code is holding a pointer to an array obtained via GetPrimitiveArrayCritical.

Unfortunately, I can’t point to a resource describing the „significant restrictions“ in more detail.

I’m basically aware of the restrictions that are mentioned in the documentation, of course. But unfortunately, as you also mentioned, they are somehow vague:

  • What exactly is an “extended period of time”? I assume that the time that is necessary to copy the array contents can not be considered as an “extended period” (otherwise, these methods would be pretty much useless anyhow, wouldn’t they?).
  • The restriction that one “must not call other JNI functions” is also vague. It can be checked automatically whether this rule is violated, by passing -Xcheck:jni as a VM arg. And for JCuda, it reports that other methods are called. Of course it does - because I’m diligently checking whether the *Critical calls caused an exception! (At least the exception checking JNI calls MUST be possible, otherwise one could not do anything in case of an error except bailing out in a ‘System.exit’ style …). I could try to check again whether there are also other JNI methods called, but since I know this restriction, and tried to avoid other JNI calls, this should not be the case.

I know that there are several different GCs, and this is a wide field with lots of research, and there are remarkably often new implementations appearing in the JVM. So I have to admit that I was not aware of the different behaviors concerning compaction - and their consequences. But your explaination sounds perfectly reasonable. Still, I’m not entirely sure how to cope with this. According to your description, this could only (?) happen when it waits on a blocked Java thread, or a Java (!) object is to be allocated - but as long as there are no callbacks to Java from JNI, I think the latter can not happen…?!

Well, I do see that pretty much any code in PointerUtils can be executing within a critical region. You’re calling initPointerData and getPointer from PointersArrayPointerData in a loop, so if the first parameter is an array, the initialization of all other parameters happens within the critical region of the first. A solution might be to delay calling PointerData::getPointer until all other JNI work is done and you’re ready to call cuLaunchKernel.

However, from the logging printout, I can’t be sure that’s the problem. Between the “initialization is deferred” and “finished initialization” there aren’t really any JNI calls except for GetPrimitiveArrayCritical itself. It could be a JVM bug. It could be a delayed issue from having called other JNI functions.

I need to load debug symbols for both JCuda and the JVM to have a better clue.

Yes, besides the general „TODO The PointerData handling should be cleaned up“, the PointersArrayPointerData is the most difficult case. The „deferred“ initialization of the ArrayBufferPointerData of course aimed at avoiding JNI calls in the critical region, and in general, the blocks between
initPointerData + getPointer and
(releasePointer) + releasePointerData
are kept as small as possible for the same reason. But when when someone creates a

Pointer p = Pointer.to(
    Pointer.to(someArray0),
    Pointer.to(someArray1),
    Pointer.to(someArray2));
);

then the actual schedule for acquiring/releasing the inner primitive arrays may become tricky. I can not guarantee that there are no usage patterns where other JNI calls (except for the exception checks) are made in a critical section - and strictly speaking, the problem would then not be a JVM bug, of course. Although I currently can not imagine a realistic use-case for „a pointer to an array of pointers to Java arrays“, I probably have to tackle the above mentioned ‚TODO‘ as soon as possible - including an analysis of possible limitations (and possibly including the information about blocking/unblocking behavior of CUDA calls, that was finally specified in one of the last CUDA versions - mapping this to JCuda has also been on my ‚todo‘ list for quite a while now…).
In any case: Thanks for having a closer look at this, I really appreciate your comments an support! :slight_smile:

„a pointer to an array of pointers to Java arrays“ is, in fact, what my Matrix code requires. I usually pass three arrays, representing three matrices.

I think you should simply get rid of GetPrimitiveArrayCritical in cuLaunchKernel. There is no reason for it, since a user can’t pass a large (or even medium) array by value to a kernel. For tiny arrays, I expect performance of *Critical to be worse.

In the docs there is no indication that *Critical throws any Java exceptions other than OOM, and the code sample does not call ExceptionCheck. I would simply check for NULL, like in the sample.

EDIT: The following parapgraphs caused a parallel discussion (pun intended :smiley: ) that has been copied to an own thread: http://forum.byte-welt.net/threads/11262-Java-GPUs-and-parallel-programming-in-general . The posts here have been edited to remove the parts that have been moved to the other thread, but no text has been modified.

I really appreciate you putting together this library. I used to meddle in C++ and CUDA, but haven’t touched them for years, instead working in C# and then Java. When I had a project that required CUDA, I tried going back to C++. Yet, things that I’ve come to think of as basic OOP were missing (esp since VS10 doesn’t support C++11). Lots of other minute rules and crud was everywhere. I felt like I was in bizarro world, and couldn’t write the code I wanted. Then, JCuda came along. It’s turned out to be really great. I can’t imagine why anyone would want to write CUDA in C++. Thank you!

I’ve also tried JavaCL and that seemed great too, but I don’t have a background in OpenCL, was confused by the availability of BLAS libraries, and read that I wouldn’t be able to debug the kernel. I did like, though, how it came with a Maven plugin for compiling kernels and creating Java wrappers. I wrote my own Maven plugin that takes care of compilation, but create wrappers manually. I also checked out Aparappi, and concluded that the authors had the wrong goals. The concept (compiling java to gpu bytecode) I think is good, but they didn’t share the emphasis on performance. They want to hide essential architectural details and make it „easy.“ In fact, I think they (ie, AMD) want to pipe pure Java through the GPU, such as standard Java arrays-of-pointers-to-objects. Of course, that is ridiculous from a performance perspective (eg, no memory coallescing on any level).

**EDIT: This discussion is continued at http://forum.byte-welt.net/threads/11262-Java-GPUs-and-parallel-programming-in-general **

Umm… the cuLaunchKernel call should not involve any Get*Critical calls. A PointerData is initialized there, but this is a ‚PointersArrayPointerData‘ - that is, it contains an Object array with additional pointers. For each of these pointers, another PointerData is created. But these are not pointers to primitive arrays either (usually, they will just be 'NativePointerObjectPointerData’s that is, essentially, an accessor for the ‚long‘ value that is stored in the Java object). The only place where the *Critical calls occur frequently (and BTW, the only place where this makes sense for me ATM) is in memcopy operations, where a Pointer.to(someArray) is used as the source- or destination pointer.
In any case, I’d like to see any special handling of Pointers for certain methods only as the last resort, and introduce something like this only when I can definitely pin a certain bug down to the respective method, and don’t find another (general) solution.

In the docs there is no indication that *Critical throws any Java exceptions other than OOM, and the code sample does not call ExceptionCheck. I would simply check for NULL, like in the sample.

Yes, I noticed that, and definitely consider this to avoid the JNI warnings (then it’s also easier to find places where a „really“ wrong method is called). The behavior should be the same, according to the docs.

I thought you understood my Matrix code. http://forum.byte-welt.net/threads/11250-Obtaining-native-pointer-from-Pointer-object?p=82645&viewfull=1#post82645 It involves *Critical calls in cuLaunchKernel.

I understood the matrix example. But it was never intended (and is still questionable) that some values are stored in a Java primitive array and interpreted as a pointer on the native side. This was achieved by drilling a hole into the “API”. (And even a real API that was designed by religously following the SOLID principles can be misused, not to mention the “not-API” of CUDA/JCuda). It might be possible to use the “deferred initialization” not only for pointers, but also for pointers to pointers, but I’ll have to think about this, how it might be implemented, which side-effects it might have, and which other, non-intended uses might already exist that might be broken and no longer work after such a change. However, at the moment, I don’t see this a critical issue, and should not be “fixed” overhasty. Hint:

ByteBuffer bb = ByteBuffer.allocateDirect(20); // <- !!!
bb.putLong(0, deviceAddress);
bb.putInt(8, width);
bb.putInt(12, height);
bb.putInt(16, pitch);
return bb;

I’m aware that there are limitations, and “things that (c/sh)ould be better”. I already thought about possible problems related to use cases like yours. Particularly, I have an uncomfortable feeling when imagining that someone could create a Pointer.to(arrayOfOneMillionOtherPointers), which in C would be perfectly feasible, but in Java/JCuda would cause a significant (and probably unexpected) overhead. Other difficulties might arise with even more deeply nested pointers. The kernel parameters already are ***pointers, but what is someone wants to use *******pointers? What if someone wants to write the value of one of the innermost fields of such an array? There are limitations, and when I’m made aware of them, I’ll think about how to handle them. But at a certain point, I probably have to accept the possible consequences of the seemingly trivial fact: Java is not C.

Using a direct buffer is an option, though a bit messy because you have to cache it (it’s slow to allocate).

What you say about having to support arbitrary nested pointers isn’t necessary. In fact, I think there is no need for nested Pointers. Memcpy can’t use them. cuLaunchKernel uses them, but it shouldn’t. In fact it’s very confusing that it does. Thia is one situation where the CUDA api shouldn’t be copied exactly. Instead, you should make a KernelParams object that takes primitive types, cuDevicePointers, and int/long arrays or (non-direct) ByteBuffers (for emulating pass-by-value structs). Calling a kernel shouldn’t involve taking a pointer to anything, because that’s not how method calls normally work. The KernelParams class can use the builder pattern, or it can expose an Object varargs and check the types at runtime. The Pointer.to(Pointer…) method should be removed.

Gymnastics using GPU pointers (like what I’m doing) don’t involve the Pointer class, or Pointer arrays.

*** Edit ***

(Feel free to move this thread)

I’m not sure whether you know how the kernel invocation worked in CUDA 1.0-3.2: It was much more complicated and error-prone, because one had to specify the arguments individually, each with its size AND its alignment - it was a hassle. And it was one of the main reasons why I created the KernelLauncher utility class, that seems to be close to the „Builder style“ invocation of kernels and the varargs-kernel call that you proposed :wink:

However, the ‚kernelArgs‘ pointer of CUDA 4.0 greatly simplified the kernel argument setup. Of course, I could have simplified it even further, and could have introduced an own ‚KernelArgs‘ class to encapuslate (and hide) this pointer to pointers. But this is not general solution. There are other methods that use Pointers to Pointers. For example, [,%20jcuda.Pointer%29"]cuModuleLoadDataEx](JCudaDriver (jcuda 11.0.0 API)[) - and unfortunately, for this method I felt the necessity to introduce the JITOptions class. I’m really not happy with that. Other methods need, arrays of pointers like [,%20int[],%20jcuda.jnpp.NppiSize%29"]nppiBGRToYCrCb420_709CSC_8u_AC4P3R](JNppi (JNpp API Documentation)[), and of course, more such methods could be introduced in future versions, even in the „core“ CUDA API. It should be possible to treat all these methods equally, as far as possible, and offer the same possibilities as the C-API. Simplifications here could be dangerous, because it’s hard to foresee how NVIDIA will change the CUDA API, and how such a „simplification“ may be adopted to such changes.

For the cuLaunchKernel case, I recently stumbled over something like that: I have an old GPU, with a low Compute Capability. And I did not consider the fact that newer devices with higher Compute Capability can allocate device memory IN kernels! The need to write back the pointer values into the pointer-to-pointers caused some headaches, but I hope that - although I’m not really satisfied with the current solution, and it needs to be cleaned up - this should now also work for cases where other methods might modify the „inner“ pointer values of a pointer-to-pointer.

In any case, I see your point about simplification (or maybe just adhering to Java conventions), but completely omitting the possibility to have pointer-to-pointers would reduce the expressiveness of the API in a way that simply can not be accepted.

One thing that I learned in my life as a software developer: When it seems that there is an easy solution, this solution is likely to break later :wink:

EDIT: In fact, even THIS thread could be split further, because it’s not about deadlocks any more, but about Pointer-to-Pointer and API aspects… I’ll consider this… :wink:

I see. I have not used these other functions, but I’ll accept that they should use Pointer-to-Pointers.

But… would they not be susceptible to same bug in PointersArrayPointerData? You should either fix the bug (ie, delay call to getPointer), or you can make two versions of PointerData::getPointer, a non-critical and a critical version. Only memcpy should call the critical version.

Regarding in-kernel malloc, I don’t believe it matters. The parameters (ie, everything in the Pointers-to-Pointers) are passed by value.


Pointer.to (Pointer.to ((CUdeviceptr) devicePtr))

becomes


__global__ void func(void* memory) {
//memory = malloc(...) // error!
*(float**)memory = malloc(...); // correct

The pointer will not be modified. Simply the device memory is modified (like it always is). The only new element is that device memory may hold pointers, and we need a method like CUdeviceptr.from(long) to do:


... memcpy devicePtr to hostBuffer ...
CUdeviceptr newDevicePointer = CUdevicePtr.from( hostBuffer.getLong() )

Anyway, I think you can keep Pointer-to-Pointers for the other use-cases (which few people use), but you should simplify cuLaunchKernel (which everyone has to use).

*** EDIT ***

I’m sure you realize that this is non-sensical:


Pointer.to (Pointer.to (Pointer.to (whatever))

Since this will pass a CPU pointer to the GPU.

Although I do not (yet) consider the current implementation as a „bug“, I see that it has to be reviewed, and I’ll try to do this soon.

Regarding in-kernel malloc, I don’t believe it matters. The parameters (ie, everything in the Pointers-to-Pointers) are passed by value.

When this bug showed up the first time, I created an example/test case, maybe I can post this beginning of next week.

I’m sure you realize that this is non-sensical:

Pointer.to (Pointer.to (Pointer.to (whatever))

Since this will pass a CPU pointer to the GPU.

Not necessarily. It will only pass the pointer to a native function, and this function might expect host pointers. This frequently appeared in JNpp. Unfortunately, in many cases, it was simply not specified whether a host- or device pointers were required. And even worse: Some functions have been specified as needing host pointers, although they needed device pointers. And the worst: In some cases, this behavior changed (silently!) from one version to another. It’s not easy (and there’s a reason why JNpp is marked as an „early beta“ version :wink: )

For cuLaunchKernel, at least, it is non-sensical.

Yes, there now are several construction sites (or things that should be reviewed, at least) :

  • Replace exception tests after Get*Critical calls with NULL checks, to prevent (unnecessary) warnings
  • Clean up pointer data handling, particularly PointersArrayPointerData
  • Consider “deferring” the Get*Critical calls as long as possible
  • Consider a dedicated utility class for the kernel arguments (in any case: In addition to the one that mimics the original API)
  • Consider a special treatment of the kernelArgs-pointer-to-pointer
    I’m not sure when I can tackle each of these points, and the progress partially depends on schedules that I can not influence (CUDA 6.0 will come soon), but I’ll try to allocate some time for that.

Thanks!