(Some of the pending tasks make it difficult to focus on the core topic here, but some comments regarding the branches of the discussion:)
I pass primitives (ints, floats) to kernels as single-element arrays. How do you do it?
The crucial question is whether these arrays are wrapped in a Pointer
, and the kernel parameters are (from the tip of my head) the only function where this is the case. In other cases, primitive arrays are only used for emulating a pass-by-reference via single-element-pointers of the C API, without using the Pointer
class, or for trivial, local initializations.
How long is long?
Waiting for CUDA to do a memcpy can be much longer than the time it takes for the memory to be copied. Although you raise an interesting question–does a large System.arraycopy lock out the GC?
Of course, that’s not something to argue about. The current approach is broken in any case.
(The arraycopy question might be derivable from objArrayKlass.cpp and copy.cpp, but even for seemingly trivial operations, there’s a lot going on in the backround, and without a deeper understanding of the mechanisms, I won’t hazard a guess).
Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread.
However, I’m not entirely sure whether this really means that there must not be any JNI calls at all, or whether it only excludes those that “…that may cause the current thread to block and wait for another Java thread”.
Yes, it includes all JNI calls. The English here is pretty clear.
I’m pretty sure that without the comma at "…functions, or any system…", it would have an entirely different meaning, but not being a native English speaker, I once more cannot argue about that.
Who in their right mind would write documentation like that?
It may be intentionally blurry. I’ve read quite a bit in the JLS and JVMS (but not so much in the OpenJDK source code yet), and these are pretty much going down to the core. The fact that some parts of the JNI seem to be a bit underspecified may be due to the JNI drilling a hole into the (specified) JVM, and maybe they just didn’t want to „over“-specify the JNI, to later have the freedom to change implementation details.
Earlier I said that GetObjectField
allocaties a local reference. Now I see there is a method EnsureLocalCapacity
which ensure that a specified number of local references can be allocated (meaning, I guess, without triggering a GC). Presumably it could be used to guarantee that GetObjectField
won’t deadlock in a critical region. But nowhere is it documented how many local references GetObjectField
needs. All that the doc says is „the VM may give the user warnings that too many local references are being created.“ Gee, thanks. The responsible programmer must conclude, no it is not ok to call any JNI functions.
When you mentioned this the first time, the EnsureLocalCapacity
came to my mind - although I would not have considered the possibility that something like GetObjectField
(or any other method that creates a local reference) would need memory for more than one local reference. But after I started looking at the code (and seeing you explicitly questioning this now), I’m not so sure about this any more.
However, running the application with -verbose:jni
should be sufficient to figure out whether the default number of 16 local references is exceeded at any point in time.
I haven’t checked this systematically yet. For complex things, e.g. the methods that deal with a CUDA_MEMCPY3D
structure, this might actually be the case. That’s one of the points that I’ll investigate in the context of this issue.
I mean, not even an env->ExceptionCheck() should be allowed?
There is no reason to call ExceptionCheck
inside your critical region. You can’t call Java code.
The ExceptionCheck
is not related to Java code, but to any exception that may be pending due to a JNI call (regardless of the open question whether the other JNI call that may have caused the exception would be allowed in a critical region or not). So assuming that one could be sure that there is there’s enough room for the object fields and the reference, one could do
jobjectArray someObjectArray = ...;
// Open critical region
void *nativeArray = env->GetPrimitiveArrayCritical(array, NULL);
// This should AT MOST create one local reference, and not cause a GC
// if this does not exceed the number of (16) available local references
// (So I think this call should be allowed...)
jobject object = env->GetObjectArrayElement(someObjectArray, someIndex);
// This call must then also be allowed, to check whether the
// above call caused an ArrayIndexOutOfBoundsException
if (env->ExceptionCheck()) { ... }
// Close critical region
env->ReleasePrimitiveArrayCritical(array, nativeArray , 0);
Again, the open issue would still be whether one may call something like GetObjectArrayElement
- considering that it should not cause any blocking, allocation or GC under the given conditions.
I hope you’re joking. It’s clear we need to contact Shepilev. I’m tired of arguing with library devs about this, and with the proliferation of cores in modern CPUs, the problem is becoming more acute. I really hope Shepilev can write a good, long article clearing this up once and for all.
I’m not joking. Some points to consider:
- He likely won’t be willing to dig through a large, old, existing codebase (created by a not-C++ - expert) and look for flaws, just for the sake of it
- The information we’re seeking may already available. Certainly not in the official docs, but maybe in one of his articles (I have read many of them, but not all - particularly, my Russian is not the best…), or maybe in articles by other JVM experts, and in the end, the truth is in the pudding (aka the JDK implementation). I’d like to avoid coming across as „Hey, I’m too lazy to look this up, so please answer my questions“.
- More specifically: We cannot expect him to invest time for this and write up a „good, long article“. We could hope for that, but not expect it.
All that said: I think we (and many other developers) agree that parts of the JNI spec are a bit vague regarding these points, and they are becoming increasingly important. But when we contact him, we should have a set of clear questions and politely ask for hints (which may well be just pointers to existing articles). Right now, the most crucial (and clear) question would be:
- Is it allowed to call other JNI functions in critical regions? / Is it allowed to call
GetObjectField
?
(which, stated like this, may still be too vague to be clearly answerable…)
EDIT>
If the manage to write this as a „good, canonical“ question (not too focused on JCuda, of course), we could consider writing it at stackoverflow as well. Aleksey Shipliev is also active there: User Aleksey Shipilev - Stack Overflow
<EDIT
Technically, all manually-managed buffers are OK. Is it possible to distinguish a GC buffer created with ByteBuffer.allocateDirect
and a JNI buffer created with NewDirectByteBuffer
?
I don’t think it’s possible now. (It might be possible with some reflection trickery, though). I assume that you referred to the necessity to do something like …
void copySomethingAsync(Pointer somePointer...) {
if (somePointer.pointsToPageableMemory()) throw Up(); // This check
...
}
… to prevent pageable memory from being involved in async operations? This may indeed be a bit tricky…