It has been a while since these points had been directly relevant. Before CUDA 4.1, there had not been any details in the spec about the async behavior at all. But when this information became part of the spec, I reviewed the relevant parts (and IIRC, this was also around the time when I refactored the memory handling to use the „deferred“ calls to the Critical
methods). I thought (or rather hoped) to have reached a relatively stable state now.
Looking at the hotspot code may help at some points. Fortunately, things like the gcLocker.hpp
header contain some comments, but the HotSpot is a complex beast, and even with comments, this feels a lot like reverse engineering and guesswork. (And the current implementation is still only one implementation - as long as it’s not specified, it’s always brittle to rely on that).
For example, (disclaimer: I only started to look at the code, but) I don’t see a reason why allocating in a critical section will necessarily cause problems. (It might cause problems, intuitively, when allocation requires a GC, but there’s certainly much (much) more behind that).
It should be easy to replace GetPrimitiveArrayCritical calls with Get*ArrayElements calls
This will of course always be possible. But the arrays in a GPU application may be large. Using the Get...ArrayElements
methods would (from a user perspective) be much more convenient than manually copying arrays into direct ByteBuffers
, but would still suffer from the same problem: In order to get the elements of a 1GB array, a new 1GB array has to be allocated and filled - and again, just in order to copy this new array to the device, and then delete it. Considering the fact that memory transfers already are the bottleneck in most GPU applications, I’d really like to avoid this…
Async memcopies aren’t compatible with *Critical. Moreover, they require a clean up mechanism for Release*ArrayElements but also for direct buffers. The reason is that you need to create and hold a reference to the buffer in case the Java code does not keep one, and you need to destroy it afterwards. A cleanup mechanism probably won’t fit into a thin wrapper.
I think it makes sense to just restrict Async memcopies to manually-managed, CUDA-allocated host or unified memory.
This is true. And in fact, in JOCL, I went some extra miles for creating some cleanup mechanisms, and for making sure that there is no interference between aync operations and GC - although the only solution that seems viable here is the „sledgehammer approach“: Any call to an async copy that involves Java arrays will result in an UnsupportedOperationException
. The async copies are only possible with direct- page-locked- or device-memory.
(In OpenCL, the async behavior is controlled via a flag, and not via dedicated ...Async
methods, but that’s only a minor difference to CUDA. More importantly: The behavior in OpenCL was specified in detail. Maybe I can transfer some of the approaches of JOCL to JCuda for this case, but I still have to analyze it further…)
For JCuda, everything seemed safe, because only device-to-device-copies had been really asynchronous. Everything else was essentially „synchronous for the host“. (At least, I thought so, until the updated spec…).
So the safest ways would probably be to
- either replace the
Critical
methods with theGet...Elements
ones - with the potential drawback of large allocations+copies happening internally, or - throw an exception when trying to use Java arrays in async operations
I think that the latter may be the better approach here. The impact that this may have to existing applications seems to be more foreseeable: They might throw an exception after the update, but only if they used operations that may be unsafe. The exception would make this unambiguously clear, and the fix for the library user would be trivial: „Just remove the ...Async
suffix“.
I think that disallowing async operations should only be necessary for real Java arrays, though. Async operations with direct buffers should still be safe (or do you think that they should be disallowed, too?)
I hope that I can allocate some more time soon, to investigate this further, read a bit more of the code, and try out the logging options that you mentioned. If you gain any further insights, or have other suggestions, I’d really appreciate your input!
EDIT: BTW: I just looked at some older related threads. Things like „deferring“ the criticial section (when using pointers to java arrays) had in fact been done in response to these threads. It’s a bit depressing that there still seem to be more deeply hidden flaws, but maybe they can eventually be ironed out. (At least, until NVIDIA introduces a new CUDA version with some new concepts that haven’t been anticipated until now…)