Multiple threads -- GetPrimitiveArrayCritical


#1

JCuda makes extensive use of GetPrimitiveArrayCritical and related JNI methods. However, it violates the contract of the method and this causes deadlocks. The contract states:

After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical. We must treat the code inside this pair of functions as running in a “critical region.” Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread.

JCuda calls JNI functions inside the critical regions. For example, method initPointerData can call GetPrimitiveArrayCritical back-to-back (calling it inside a critical region). initMemcpy2DData calls initPointerData followed by more JNI calls. I can’t figure out how cuMemcpyHtoDAsync works (would love if you explained it), but presumably it leaves the critical region open until the memcpy finishes. That would be really bad.

More information about Get*Critical: https://shipilev.net/jvm-anatomy-park/9-jni-critical-gclocker/

Currently, I am experiencing JVM crashes when running multiple copies of some neural network code in their own threads, and I suspect this to be the cause. I’m using JCuda 8, jblas, and jcudnn.

  • Would it be possible to create a version of JCuda that doesn’t use Get*Critical methods?
  • Do you have JCuda binaries with enabled logging?
  • Does Logger::log need to call fflush?

#2
  • Does Logger::log need to call fflush?

Adding fflush in the logger should not strictly be necessary, but could indeed be helpful. I think the exact behavior is implementation/OS-dependent anyhow. But I’m aware that the log messages do, for example, not immediately appear in the Eclipse console. Maybe this can be achieved with fflush, I’ll try this out.


  • Do you have JCuda binaries with enabled logging?

The log level that is set on Java side is directly passed to the native side. So it is possible to call

JCuda.setLogLevel(LogLevel.LOG_DEBUGTRACE)

to obtain really detailed log output (so detailed that it is not really specified and recommended, but really only intended for debugging things exactly like the lowest-level memory handling. Some more hints on the Critical parts will follow below)


Regarding the JVM crashes: These should usually cause a dump file (hs_err...txt) to be written. It could be helfpul if you provided the relevant section of this file (you may likely omit the part that contains information about the libs that are loaded on your system etc). It would be even better if you had some small sample/test program where the problem could be reproduced. But I see that this may be difficult due to the potential complexity of the application and the hard-to-predict elements that are involved here (mainly the GC).


The main topic. namely the Critical methods:

In general, I’m aware of the implications of using the Critical methods. And I tried my best to cope with them. Although it’s hard to make guarantees here. (Certain aspects of the behavior are not specified or depend on subtle, unpredicable things like the GC). For example, I know that running a JCuda application with the -Xcheck:jni flag causes warnings to be printed, about JNI functions being called inside critical regions. But to the best of my knowledge, this is not necessarily an error, but a rather general hint that something might be wrong.

I also tried to minimize the actual critical sections. For example, initPointerData will not call any Critical methods directly. They are only supposed to be called when the actual pointer is obtained. Some pseudocode:

// NO Critical section is opened here
PointerData pointerData = initPointerData(somePointerToJavaArray); 

// These calls are safe
someJniCall();
someJniCallThatMayEvenBeBlocking();

// Here, the critical section is opened
useTheData(pointerData->getPointer());

// JNI calls should be avoided here, but as long as 
// the are not IO-calls or otherwise blocking, they
// SHOULD still be safe...

// Here, the critical section is closed
releasePointerData(pointerData);

I can’t figure out how cuMemcpyHtoDAsync works (would love if you explained it), but presumably it leaves the critical region open until the memcpy finishes. That would be really bad.

Despite its name, the cuMemcpyHtoDAsync function is indeed blocking with respect to the host. I once dissected the statements that had been made in one of the CUDA specs, and summarized them at http://jcuda.org/jcuda/JCuda.html#Asynchronous :

Asynchronous memory copy operations:

“For transfers from pageable host memory to device memory, host memory is copied to a staging buffer immediately (no device synchronization is performed). The function will return once the pageable buffer has been copied to the staging memory. The DMA transfer to final destination may not have completed.”

This means that for the host, the function behaves like a synchronous function: When the function returns, the host memory has been copied (and will only be transferred to the device internally).

So: Yes, the critical section is left open until the memcpy finishes. After all, this is the intended usage pattern of the Critical methods. I assume your statement that this would be “really bad” referred to some loss of asynchronity, but since cuMemcpyHtoDAsync is actually blocking, this should not be the case.


The most critical part (no pun intended) seems to be the initMemcpy2DData call. This does have to open the critical section, and does a considerable amount of work before the critical section is closed. But this work only consists of “trivial” operations (namely, obtaining field values), so I think it should be safe.


Disclaimer: All the statements that I made here are to the best of my knowledge. This might still mean that they are outright wrong, or that I overlooked a certain application pattern or caveat. The interdependencies between async operations and the GC are tricky, and although JCuda has been under development for ~8 years now, there may still be issues.

So if you encounter any problem (like the crash), I’ll try to figure out what might cause it. And if you encounter something in the implementation that is obviously wrong, of course, I’ll try to fix it.


As mentioned above: Running the application with JCuda.setLogLevel(LogLevel.LOG_DEBUGTRACE) enabled should also print information about the places where a Critical call actually takes place. This should only be the case when obtaining the pointer to a Java array, as indicated by the line https://github.com/jcuda/jcuda-common/blob/master/JCudaCommonJNI/src/PointerUtils.hpp#L685 , and the corresponding releasePointer call. Maybe this already gives some hints.


As for the point of offering a version of JCuda that does not use Critical methods: I’ll certainly consider this. But this would touch some of the implementations where I’d really have to take great care to not break something that worked until now. Maybe it can be solved with some compile-time/#define trickery, but I’ll have to allocate some time to take another look at this.


#3

Thank you for your response. I will try setting debug level. I hope debug statements won’t get swallowed. I will also try to get a dump and hopefully even a reproducible example.


I think the problematic thing inside Critical methods is not to block, but to allocate on the heap (or block on an operation which tries to allocate on the heap). Of course, there is no documentation about which JNI methods allocate, but it seems that GetObjectField does. It creates a local reference to the returned object so that it doesn’t get collected.

Here’s what I tracked down. GetObjectField calls jni_GetObjectField in jni.cpp. It calls JNIHandles::make_local in jniHandles.cpp which calls allocate_handle. The handle class is documented as: “In order to preserve oops during garbage collection, they should be allocated and passed around via Handles within the VM. A handle is simply an extra indirection allocated in a thread local handle area.” An oop is “An object pointer. Specifically, a pointer into the GC-managed heap.”


Regarding cuMemcpyHtoDAsync: The documentation may have cleared this up only recently, but now it states:

Memcpy
In the reference documentation, each memcpy function is categorized as synchronous or
asynchronous, corresponding to the definitions below.

Asynchronous

  1. For transfers from device memory to pageable host memory, the function will return
    only once the copy has completed.
  2. For transfers from any host memory to any host memory, the function is fully
    synchronous with respect to the host.
  3. For all other transfers, the function is fully asynchronous. If pageable memory
    must first be staged to pinned memory, this will be handled asynchronously with a
    worker thread.

So this method is indeed async and broken. I haven’t used cuMemcpyHtoDAsync with pageable memory. WIth pinned host memory (manually allocated via cuMemAllocHost, of course), it worked fine.


#4

Thanks for these pointers. I’ll have to take a closer look at all this ASAP.

The documentation indeed has been updated. The emphasized statement

If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

actually was already part of the initial version. But the initial versions contained additional points, mentioning clearly what I already quoted, about the pageable memory. These statements seem a bit contradictory for me (to say the least) - but maybe that’s one of the reasons of why they reworded it, assuming that the behavior did not actually change.

Whether or not the allocation in GetObjectField actually is a problem (or: under which conditions) is not entirely clear for me. I’ll have to re-read the article that you mentioned (and the ones that it links to).

If you have any ideas or recommendations from your research so far, I’d be glad to hear them.

I always tried to support standard Java arrays in JCuda as far as reasonably possible. Other JNI libraries usually just bail out when they receive anything that is not a direct byte buffer, but this is often just inconvenient: Data that is already available e.g. as a float[] array (and usually, as a large array!) would have to be copied to a direct ByteBuffer, only to pass it into the JNI layer, where it is only copied to the device. Directly copying the Java array to the device (by using the Critical functions) seemed worthwhile, considering the performance- and memory overhead that it could avoid.

(Although, admittedly, I didn’t expect it to be so tricky, back when I started JCuda in 2008…)


#5

The article I linked doesn’t go into details, and internet searches turn up only ignorance and confusion. This is not the first time I researched the issue, but it’s become clearer in my head and reading the HotSpot source helped a lot.


Let me just summarize my findings.

  • It is ok to call *Critical inside a critical section.
  • Critical sections will cause other threads to stall either because those threads need to do allocations or because they call Get*Critical.
  • The flags -XX:+PrintJNIGCStalls -XX:+PrintGCDetails will print useful log messages about stalled allocations and collections. (See gcLocker.cpp for exact messages.)

I think it is possible to use GetPrimitveArrayCritical safely, but the stalls could cause big slowdowns in multithreaded code because the critical regions can be left open for a long time waiting for CUDA methods to return.

It should be easy to replace GetPrimitiveArrayCritical calls with Get*ArrayElements calls, possibly with a flag at runtime. No reason to bail out. I like the Pointer mechanism, especially as a library user.


Async memcopies aren’t compatible with *Critical. Moreover, they require a clean up mechanism for Release*ArrayElements but also for direct buffers. The reason is that you need to create and hold a reference to the buffer in case the Java code does not keep one, and you need to destroy it afterwards. A cleanup mechanism probably won’t fit into a thin wrapper.

I think it makes sense to just restrict Async memcopies to manually-managed, CUDA-allocated host or unified memory.


My analysis of the HotSpot code:

We know GetObjectField does allocations. GetPrimitiveArrayCritical itself does not seem to do allocations.

jni_GetPrimitiveArrayCritical in jni.cpp calls GC_locker::lock_critical(thread) before doing anything, so everything else must be ok with being inside a critical region. In fact, the code of this method is simple. It locks the GC and just returns the pointer that is stored in the reference.

However, what does lock_critical do and can it cause problems itself? lock_critical in gcLocker.inline.hpp doesn’t do anything except increment a counter (via thread->enter_critical()) if the thread is already in a critical region (proving that the JVM is designed for back-to-back calls to *Critical), but it will block in GC_locker::jni_lock in gcLocker.cpp if “we know at least one thread is in a JNI critical region and we need a GC.” I guess if it didn’t do that, the GC might be delayed indefinitely.

gcLocker.cpp includes interesting logging code.


#6

It has been a while since these points had been directly relevant. Before CUDA 4.1, there had not been any details in the spec about the async behavior at all. But when this information became part of the spec, I reviewed the relevant parts (and IIRC, this was also around the time when I refactored the memory handling to use the “deferred” calls to the Critical methods). I thought (or rather hoped) to have reached a relatively stable state now.

Looking at the hotspot code may help at some points. Fortunately, things like the gcLocker.hpp header contain some comments, but the HotSpot is a complex beast, and even with comments, this feels a lot like reverse engineering and guesswork. (And the current implementation is still only one implementation - as long as it’s not specified, it’s always brittle to rely on that).

For example, (disclaimer: I only started to look at the code, but) I don’t see a reason why allocating in a critical section will necessarily cause problems. (It might cause problems, intuitively, when allocation requires a GC, but there’s certainly much (much) more behind that).

It should be easy to replace GetPrimitiveArrayCritical calls with Get*ArrayElements calls

This will of course always be possible. But the arrays in a GPU application may be large. Using the Get...ArrayElements methods would (from a user perspective) be much more convenient than manually copying arrays into direct ByteBuffers, but would still suffer from the same problem: In order to get the elements of a 1GB array, a new 1GB array has to be allocated and filled - and again, just in order to copy this new array to the device, and then delete it. Considering the fact that memory transfers already are the bottleneck in most GPU applications, I’d really like to avoid this…

Async memcopies aren’t compatible with *Critical. Moreover, they require a clean up mechanism for Release*ArrayElements but also for direct buffers. The reason is that you need to create and hold a reference to the buffer in case the Java code does not keep one, and you need to destroy it afterwards. A cleanup mechanism probably won’t fit into a thin wrapper.

I think it makes sense to just restrict Async memcopies to manually-managed, CUDA-allocated host or unified memory.

This is true. And in fact, in JOCL, I went some extra miles for creating some cleanup mechanisms, and for making sure that there is no interference between aync operations and GC - although the only solution that seems viable here is the “sledgehammer approach”: Any call to an async copy that involves Java arrays will result in an UnsupportedOperationException. The async copies are only possible with direct- page-locked- or device-memory.

(In OpenCL, the async behavior is controlled via a flag, and not via dedicated ...Async methods, but that’s only a minor difference to CUDA. More importantly: The behavior in OpenCL was specified in detail. Maybe I can transfer some of the approaches of JOCL to JCuda for this case, but I still have to analyze it further…)

For JCuda, everything seemed safe, because only device-to-device-copies had been really asynchronous. Everything else was essentially “synchronous for the host”. (At least, I thought so, until the updated spec…).


So the safest ways would probably be to

  • either replace the Critical methods with the Get...Elements ones - with the potential drawback of large allocations+copies happening internally, or
  • throw an exception when trying to use Java arrays in async operations

I think that the latter may be the better approach here. The impact that this may have to existing applications seems to be more foreseeable: They might throw an exception after the update, but only if they used operations that may be unsafe. The exception would make this unambiguously clear, and the fix for the library user would be trivial: “Just remove the ...Async suffix”.

I think that disallowing async operations should only be necessary for real Java arrays, though. Async operations with direct buffers should still be safe (or do you think that they should be disallowed, too?)

I hope that I can allocate some more time soon, to investigate this further, read a bit more of the code, and try out the logging options that you mentioned. If you gain any further insights, or have other suggestions, I’d really appreciate your input!


EDIT: BTW: I just looked at some older related threads. Things like “deferring” the criticial section (when using pointers to java arrays) had in fact been done in response to these threads. It’s a bit depressing that there still seem to be more deeply hidden flaws, but maybe they can eventually be ironed out. (At least, until NVIDIA introduces a new CUDA version with some new concepts that haven’t been anticipated until now…)


#7

Well, what is not reverse engineering is to heed the contract of Get*Critical.

To summarize:

  • Do not call JNI methods.
  • Do not make long pauses
  • You can nest multiple Get*Critical
  • Check the return value for NULL
  • Set mode correctly as in Get<PrimitiveType>ArrayElements because we may receive a copy.

According to the doc, you shouldn’t be using Get*Critical at all, because CUDA functions make long pauses. By trying to understand how Get*Critical works, I’m actually trying to justify its (ab)use. The safe thing is not to use it. You can try contacting Aleksey Shipilev, the author of the article. He’s a JVM developer and probably wouldn’t mind answering questions. I would suggest to him that JNI calls should check if they are in a critical section, making all library authors aware of this problem.

You should give users the option to use Get<PrimitiveType>ArrayElements. JCuda encourages passing small arrays for many things like parameters, and it’s inconvenient to get rid of arrays completely. For all other things, the user can use page-locked memory, which is faster. I will take this option. Even after correcting the deadlock in initMemcpy2DData, Get*Crticial-induced thread stalls are undesirable for me.


I think that disallowing async operations should only be necessary for real Java arrays, though. Async operations with direct buffers should still be safe (or do you think that they should be disallowed, too?)

Async operations with GC-managed direct buffers will not be safe unless you use NewGlobalRef and DeleteGlobalRef. Otherwise, the Java caller can lose their reference to the buffer and GC can collect it. Calling DeleteGlobalRef would require a cleanup mechanism.


Oh yeah, I’ve already complained about all this :wink: I appreciate that you made corrections. I haven’t had deadlocks until recently (probably because my usage pattern changed a bit), but stalls have been a problem that I’ve been brushing off.


#8

Sure, I’ve tried to obey the contract of the Critical methods, carefully considering the implications that it has regarding (async and sync) memory copies. And some points of your checklist are already considered:

  • Set mode correctly:
    • Done
  • You can nest multiple Critical:
  • Check the return value for NULL
    • Done. This was not done in the early days of JCuda, but has been changed to a NULL-check (mainly based on your feedback in other threads - thanks again, I really appreciate that!)

The remaining points are

  • Do not make long pauses
  • Do not call JNI methods

Regarding the first point, the documentation says:

After calling GetPrimitiveArrayCritical, the native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical.

First of all: It says that the code should not run for an extended period of time. This is different from saying that it must not run for an extended period of time, at least based on my understanding of these words, in the sense of RFC2119.

But possible nitpicking aside:

The example that is shown in the documentation does two (nested) Critical calls, with the intention to copy the data from one array to the other.

If copying the contents of such an array already counted as “running for an extended period of time”, then I wonder what use-case the Critical methods should actually have. To my understanding, they have exactly this goal: I want the whole array, for a short time (namely, as long as it takes to copy the contents of the array), and the JVM should try hard to not create an additional copy just for this operation. (Or to put it that way: There is no operation that needs the whole array and that still takes less time than just reading/copying the whole array once…)

Regarding the second point, “Do not call JNI methods”, the documentation says:

We must treat the code inside this pair of functions as running in a “critical region.” Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)

Here, the word “must” is used. However, I’m not entirely sure whether this really means that there must not be any JNI calls at all, or whether it only excludes those that “…that may cause the current thread to block and wait for another Java thread”. I cannot imagine the first one, from a technical point of view. I mean, not even an env->ExceptionCheck() should be allowed? Even though the error check for the Critical method itself is done with a NULL-check, I cannot imagine this to be true.

(But admittedly, some hints from the code that you linked to indicate that this might really be the case. For example, in gcLocker.cpp Line 131, there is a comment

ResourceMark rm; // JavaThread::name() allocates to convert to UTF8

which might mean that the ::name() function is not used there because of some deeply hidden allocation. Whether this directly implies a strict limitation for JNI code is hard to say)


The idea of asking Алексей for input here is probably a good one. (Although I’m in the dilemma between potentially receiving helpful and profound input, and potentially being told that the whole JNI part of JCuda is fundamentally broken :confused: At least, I’m already aware of some points that could and should be improved, so it’s not unlikely that he’d be able to review the code into oblivion). I really liked his in-depth discussion of other topics, e.g. https://shipilev.net/blog/2015/black-magic-method-dispatch/ , but hadn’t seen his most recent posts about the JVM anatomy.

Letting JNI calls automatically emit warnings when they are called in a critical section should not be necessary: This can already be checked with the Xcheck:jni argument. Furthermore, the documentation of this flag (that I already referred to in one of the earlier posts) says

The general recommendation is not to use other JNI functions within a JNI critical section, and in particular any JNI function that could potentially cause a deadlock. The warning printed above by the -Xcheck:jni option is thus an indication of a potential issue; it does not always indicate an application bug.

Which (for me) means that it <handwaving> may sometimes be OK to call JNI methods </handwaving>, although the details are not specified.


There are not so many places in JCuda where small primitive arrays are used. For the kernel arguments, there are arrays of pointer objects, but the main use case of primitive arrays is that of passing e.g. a 1000x1000-matrix (which is given as a float[] array) to the JNI side, to let it be copied to the device.

And again, this was (supposed to be) safe even for the Async case, because the CUDA function was supposed to be blocking when host memory was involved. This will have to be updated, meaning that Async operations should be disallowed for Java arrays, and as you said, probably also for direct buffers - so it can probably be summarized by saying that Async calls should be disallowed for any pageable memory.


#9

There are not so many places in JCuda where small primitive arrays are used.

I pass primitives (ints, floats) to kernels as single-element arrays. How do you do it?


Regarding long pauses: Yes, it’s a “should.” How long is long? Long enough for the GC to notice. How quickly will the GC notice? It depends on allocation pressure. But suffice it to say that CUDA calls can block for ages (ie, seconds). CUDA methods block when the command queue fills up with small kernel launches, and they block even longer when it fills up with large kernel launches. Meanwhile, a modern 16-core machine can allocate objects very quickly, requiring very frequent young generation GCs.

Waiting for CUDA to do a memcpy can be much longer than the time it takes for the memory to be copied. Although you raise an interesting question–does a large System.arraycopy lock out the GC?


Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread.

However, I’m not entirely sure whether this really means that there must not be any JNI calls at all, or whether it only excludes those that “…that may cause the current thread to block and wait for another Java thread”.

Yes, it includes all JNI calls. The English here is pretty clear.

The general recommendation is not to use other JNI functions within a JNI critical section

Which (for me) means that it may sometimes be OK to call JNI methods , although the details are not specified.

Yeah, it might be ok if you’re a JVM developer and don’t care about tying yourself to the current version of a particular JVM. Who in their right mind would write documentation like that?

Earlier I said that GetObjectField allocaties a local reference. Now I see there is a method EnsureLocalCapacity which ensure that a specified number of local references can be allocated (meaning, I guess, without triggering a GC). Presumably it could be used to guarantee that GetObjectField won’t deadlock in a critical region. But nowhere is it documented how many local references GetObjectField needs. All that the doc says is “the VM may give the user warnings that too many local references are being created.” Gee, thanks. The responsible programmer must conclude, no it is not ok to call any JNI functions.

I mean, not even an env->ExceptionCheck() should be allowed?

There is no reason to call ExceptionCheck inside your critical region. You can’t call Java code.

The idea of asking Алексей for input here is probably a good one. Although I’m in the dilemma between potentially receiving helpful and profound input, and potentially being told that the whole JNI part of JCuda is fundamentally broken :confused:

I hope you’re joking. It’s clear we need to contact Shepilev. I’m tired of arguing with library devs about this, and with the proliferation of cores in modern CPUs, the problem is becoming more acute. I really hope Shepilev can write a good, long article clearing this up once and for all.


it can probably be summarized by saying that Async calls should be disallowed for any pageable memory

Technically, all manually-managed buffers are OK. Is it possible to distinguish a GC buffer created with ByteBuffer.allocateDirect and a JNI buffer created with NewDirectByteBuffer?


#10

(Some of the pending tasks make it difficult to focus on the core topic here, but some comments regarding the branches of the discussion:)


I pass primitives (ints, floats) to kernels as single-element arrays. How do you do it?

The crucial question is whether these arrays are wrapped in a Pointer, and the kernel parameters are (from the tip of my head) the only function where this is the case. In other cases, primitive arrays are only used for emulating a pass-by-reference via single-element-pointers of the C API, without using the Pointer class, or for trivial, local initializations.

How long is long?

Waiting for CUDA to do a memcpy can be much longer than the time it takes for the memory to be copied. Although you raise an interesting question–does a large System.arraycopy lock out the GC?

Of course, that’s not something to argue about. The current approach is broken in any case.

(The arraycopy question might be derivable from objArrayKlass.cpp and copy.cpp, but even for seemingly trivial operations, there’s a lot going on in the backround, and without a deeper understanding of the mechanisms, I won’t hazard a guess).


Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread.

However, I’m not entirely sure whether this really means that there must not be any JNI calls at all, or whether it only excludes those that “…that may cause the current thread to block and wait for another Java thread”.

Yes, it includes all JNI calls. The English here is pretty clear.

I’m pretty sure that without the comma at "…functions, or any system…", it would have an entirely different meaning, but not being a native English speaker, I once more cannot argue about that.

Who in their right mind would write documentation like that?

It may be intentionally blurry. I’ve read quite a bit in the JLS and JVMS (but not so much in the OpenJDK source code yet), and these are pretty much going down to the core. The fact that some parts of the JNI seem to be a bit underspecified may be due to the JNI drilling a hole into the (specified) JVM, and maybe they just didn’t want to “over”-specify the JNI, to later have the freedom to change implementation details.


Earlier I said that GetObjectField allocaties a local reference. Now I see there is a method EnsureLocalCapacity which ensure that a specified number of local references can be allocated (meaning, I guess, without triggering a GC). Presumably it could be used to guarantee that GetObjectField won’t deadlock in a critical region. But nowhere is it documented how many local references GetObjectField needs. All that the doc says is “the VM may give the user warnings that too many local references are being created.” Gee, thanks. The responsible programmer must conclude, no it is not ok to call any JNI functions.

When you mentioned this the first time, the EnsureLocalCapacity came to my mind - although I would not have considered the possibility that something like GetObjectField (or any other method that creates a local reference) would need memory for more than one local reference. But after I started looking at the code (and seeing you explicitly questioning this now), I’m not so sure about this any more.

However, running the application with -verbose:jni should be sufficient to figure out whether the default number of 16 local references is exceeded at any point in time.

I haven’t checked this systematically yet. For complex things, e.g. the methods that deal with a CUDA_MEMCPY3D structure, this might actually be the case. That’s one of the points that I’ll investigate in the context of this issue.


I mean, not even an env->ExceptionCheck() should be allowed?

There is no reason to call ExceptionCheck inside your critical region. You can’t call Java code.

The ExceptionCheck is not related to Java code, but to any exception that may be pending due to a JNI call (regardless of the open question whether the other JNI call that may have caused the exception would be allowed in a critical region or not). So assuming that one could be sure that there is there’s enough room for the object fields and the reference, one could do

jobjectArray someObjectArray = ...;

// Open critical region
void *nativeArray = env->GetPrimitiveArrayCritical(array, NULL);

// This should AT MOST create one local reference, and not cause a GC
// if this does not exceed the number of (16) available local references 
// (So I think this call should be allowed...)
jobject object = env->GetObjectArrayElement(someObjectArray, someIndex);

// This call must then also be allowed, to check whether the 
// above call caused an ArrayIndexOutOfBoundsException
if (env->ExceptionCheck()) { ... }

// Close critical region
env->ReleasePrimitiveArrayCritical(array, nativeArray , 0);

Again, the open issue would still be whether one may call something like GetObjectArrayElement - considering that it should not cause any blocking, allocation or GC under the given conditions.


I hope you’re joking. It’s clear we need to contact Shepilev. I’m tired of arguing with library devs about this, and with the proliferation of cores in modern CPUs, the problem is becoming more acute. I really hope Shepilev can write a good, long article clearing this up once and for all.

I’m not joking. Some points to consider:

  • He likely won’t be willing to dig through a large, old, existing codebase (created by a not-C++ - expert) and look for flaws, just for the sake of it
  • The information we’re seeking may already available. Certainly not in the official docs, but maybe in one of his articles (I have read many of them, but not all - particularly, my Russian is not the best…), or maybe in articles by other JVM experts, and in the end, the truth is in the pudding (aka the JDK implementation). I’d like to avoid coming across as “Hey, I’m too lazy to look this up, so please answer my questions”.
  • More specifically: We cannot expect him to invest time for this and write up a “good, long article”. We could hope for that, but not expect it.

All that said: I think we (and many other developers) agree that parts of the JNI spec are a bit vague regarding these points, and they are becoming increasingly important. But when we contact him, we should have a set of clear questions and politely ask for hints (which may well be just pointers to existing articles). Right now, the most crucial (and clear) question would be:

  • Is it allowed to call other JNI functions in critical regions? / Is it allowed to call GetObjectField?

(which, stated like this, may still be too vague to be clearly answerable…)

EDIT>

If the manage to write this as a “good, canonical” question (not too focused on JCuda, of course), we could consider writing it at stackoverflow as well. Aleksey Shipliev is also active there: https://stackoverflow.com/users/2613885/aleksey-shipilev

<EDIT


Technically, all manually-managed buffers are OK. Is it possible to distinguish a GC buffer created with ByteBuffer.allocateDirect and a JNI buffer created with NewDirectByteBuffer?

I don’t think it’s possible now. (It might be possible with some reflection trickery, though). I assume that you referred to the necessity to do something like …

void copySomethingAsync(Pointer somePointer...) {
    if (somePointer.pointsToPageableMemory()) throw Up(); // This check
    ...
}

… to prevent pageable memory from being involved in async operations? This may indeed be a bit tricky…


#11

I pass primitives (ints, floats) to kernels as single-element arrays. How do you do it?

The crucial question is whether these arrays are wrapped in a Pointer, and the kernel parameters are (from the tip of my head) the only function where this is the case. …

more words

I feel like sometimes you get distracted from the point of the discussion. Anyway, tiny arrays are passed to kernel launches. If these arrays are accessed with Get*Critical, they cause all the problems of Get*Critical.

What I’m getting at is I can’t stop using arrays. I need the option to use Get*ArrayElements.


I’m going to contact Shipilev myself because this is getting frustrating. Our questions are general. He doesn’t need to critique your code.


What Async method needs to ensure is that the pointer is not GC-collectable, not that it isn’t pageable. Pageable memory created through JNI is ok, because it isn’t GC-collectable. If JNI NewDirectByteBuffer returns a special class, you can check this. Actually, you can allow all direct buffers with the caveat that the caller must not loose their reference. It’s an easy requirement, just a surprising one that is easy to miss.


#12

I didn’t mean to bury the core issue in a “frustrating” discussion.

(I’m not at my main dev PC right now, and could not yet dive deeper into this - that’s what I tried to say in the small comment)

The miscellaneous branches still seemed worthwhile talking about. Sorry if you percieved this differently.

It’s fine if you want to contact Alexey, but I hoped that we could flesh out the actual questions that we wanted to ask. If the question only is: “Which JNI methods may be called in a critical section?”, then this might lead to discussions or answers that are (also) too far from the core issue.

(The issue of which it is not yet entirely clear whether it is actually caused by the Critical methods, by the way. It might still be something completely different. A clarification about the Critical/JNI-call point would be nice to have, nevertheless)

If you contact him, it would be nice if you could set me CC, because I assume that the first answer will start with “It depends…”, followed by a discussion, illuminating points that are similar to the ones that we discussed in this thread.

Or to put it that way: What exactly would you ask him now?


#13

What exactly would you ask him now?

I want him to tell you (and everyone else) that you can’t call JNI methods and that you need to give users the option of using Get*ArrayElements so that their threads don’t stall.


#14

Then he will likely tell you that you can call JNI methods in critical sections, but only certain ones, under certain (complicated) conditions.

However, I thought about adding some sort of threshold for the array access. For the cases where only “few” elements are contained in the array, obtaining the array elements may be fine. For large chunks (like the 1000x1000 matrix as a float[] array), the Critical method may be used. But that would be a magic variable.

In any case, just changing from Critical to Get Elements seems likely to cause trouble - namely people running out of memory and/or seeing performance degradation. There must be a better solution for that.

But let’s see what Alexey Shipilev says.


#15

He didn’t do that. At least, he didn’t explain the conditions. He said the documentation is fine. He basically defended the JNI specification being vague. Sigh. I tried explaining again, and he just didn’t get that there was a documentation problem. He said if he wrote an article clearing stating that one can’t call JNI functions and that one will cause thread stalls by spending a long time in critical regions, people will ignore him. That library devs who are looking for performance gains are blind and don’t care.


I need an option that converts all Get*Critical to Get*Elements. You can leave leave Get*Critical as the default (after you fix JNI calls), and it’s a good choice for single-threaded code. But multi-threaded code must use Get*Elements.


#16

I removed one sentence from your post. Sometimes it’s hard to restrain oneself, because there are too many of them out there, but I doubt that this was appropriate here. The response is basically what I was afraid of (when you asked me whether I was joking). He certainly is a busy man, and probably just not up to answering arbitrary questions in unsolicited mails. I’m still curious how you approached him, but this likely does not matter here any more.

In any case:

The fact that some of the current patterns have to be fixed has become clear: No Java-managed memory in async ops.

The fact that the Critical functions may cause problems is also undoubted. Whether they indeed are the reason for the crashes in your specific case is not yet entirely sure. There are many possible reasons for JVM crashes when using JCuda, and I’d still be curious to see one of the hs_err files - these might at least give a hint.

So regarding your request to generally change the Critical to GetElements calls: In doubt, I could do it in an own branch. I mean, even if I’m not sure whether I’m fixing what is actually broken, the change itself should be fairly simple, and this could at least help to just try out…

  1. whether it fixes your problem and
  2. whether it causes other problems

When we know that, we can see how the fix can be integrated into the master branch. (In the best case, it could already be part of the JCuda 0.9.0 release, which I hopefully can also tackle soon).

Sorry, I may have asked this before, but: You’re on Windows, right? I could provide the binaries of the branched version for Windows (For other OSes, it would be necessary for you to compile it on your own)


#17

I’m on Linux now.


#18

I ran some experiments today (I needed to be in the office with my colleague). The version of my code that uses netlib-java throws up many “is stalled by JNI critical section” logs, and instead of 16 active threads I often see 2. I am setting OPENBLAS_NUM_THREADS=1 to disable multithreading within OpenBLAS. The version that uses JCublas deadlocks (not crashes, sorry). The Java debugger shows that threads are stuck in cuMemcpy2DNative. Interestingly, nvidia-smi shows GPU pegged at 100%. So it is all inline with theory, except I guess the GPU usage.


#19

So you have two completely distinct versions, one only with netlib-java and one only with JCublas?

The “is stalled by JNI critical section” message stems from gcLocker.cpp, line 86, so it seems like they are also using some Critical methods.

(BTW: I stumbled over http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2007-December/000072.html , which may contain some relevant information, but may well be horribly out-dated, and in the end, the thread was once more inconclusive and lacking clear advice…)

Let’s see how the version without Critical calls behaves. I’ll hopefully be able to create the branch pretty soon and ping you (here) when it is ready to test.


#20

Yes, the code paths are completely separate. I had messaged the netlib-java author a while back about not using Get*Critical, but I think I talked about offering a version that used ByteBuffers and he didn’t want to put in the significant work. I’ll try asking him again to offer Get*ArrayElements.

However, there is very exciting (and completely under-the-radar) news about Hotspot supporting autovectorization. In Java 9 it’s supposedly fairly robust and could be used to write a not-completely-inefficient GEMM. Java 9 doesn’t support the FMA instruction, only MUL+ADD. But, FMA was added in recent Java 10 commits. With Oracle promising frequent Java updates, we may live to see an efficient pure-Java BLAS. See the comments here: https://stackoverflow.com/a/20267515/1151521