I’m not entirely up to date here and not sooo deeply involved in the JDK internals (and … the stackoverflow answer is already 4 years old, so there definitely did happen a lot in the meantime). But I think that topics like (auto)vectorization and FMA are also tackled in the context of the Panama project - at least (or rather: going further), things like the vectors (from one of the latest commits at http://hg.openjdk.java.net/panama/panama/jdk/rev/d83170db025b ) are supposed to be implemented with intrinsics.
The first pass of removing the critical array accesses are contained in
- https://github.com/jcuda/jcuda-common/tree/no_critical
- https://github.com/jcuda/jcuda/tree/no_critical
But note that this does NOT yet cover the critical accesses in the Pointer
data.
Until now, this only refers to the other places where the Critical
methods had been used. (These are cases where only few (1…4) array elements are read or written, so using the Critical
methods there did not really make sense anyhow…). These have now been changed to use the Get...Elements
methods. And these changes will very likely be moved into the master branch after some more testing.
So the current changes probably will not (or rather: should not!) affect the behavior that you observed.
The change to replace the Critical
methods with the GetElements
also for the Pointer
-related memory handling will likely have a greater impact, and will be done in a second step.
Explicit vectorization is nice, but Project Panama is still a ways off. But auto-vectorization is here and is very cool. Read the comments (they are from this year) and the linked slides. Especially note the restrictions on autovectorizable loops. (Main one is don’t manually unroll.)
You gotta love the JNI spec. Documentation for Release*ArrayElements
:
In most cases, programmers pass “0” to the mode argument to ensure consistent behavior for both pinned and copied arrays. The other options give the programmer more control over memory management and should be used with extreme care.
Was your care extreme? Will my computer blow up because you didn’t pass 0
?
Thanks, will wait for further updates.
I see you use JNI_COMMIT
in JCudaRuntime. This must be a mistake, because JNI_COMMIT
does not release memory. It’s something to use for concurrency, I guess (with no documented semantics, of course).
Thanks, I’ll review this. JNI_COMMIT
certainly is a mistake. In cases where I’m only reading, I use JNI_ABORT
, but otherwise, 0
should be used everywhere.
There indeed had been a few places with JNI_COMMIT
(also in JCublas and JCurand - fortunately, only for small (4-element) arrays). This will be part of the fixes in the no_critical
branch that will go into master
in any case.
I have updated the no_critical
branch of the jcuda-common
project via https://github.com/jcuda/jcuda-common/commit/8a0d5a404b1554a0dc260c1bb92f77acbd3f1d1b to no longer use critical array accesses in the PointerData-classes.
(Right now, the implementation is a bit awkward: With the critical methods, it was possible to simply obtain the array address. With the GetElements
-methods, there is some type-based dispatching involved to the GetByte
… … GetDouble
versions. The current solution is mainly intended to check whether it solves the problem. If it does, I’ll see whether I find a more elegant approach).
I’m trying to compile JCuda under Arch Linux using GCC 6.4.1 and Java 8. I’m following the Linux script (not executing it, just following). I cloned all the repositories, I checked out branch master except for jcuda-common
and jcuda
which are on branch no_critical
. I ran the following line:
env CXX=/usr/bin/c++-6 CC=/usr/bin/cc-6 cmake -DCUDA_nvrtc_LIBRARY=/opt/cuda/lib64/libnvrtc.so ./jcuda-main
followed by
make all
and I’m stuck on
~/jcuda> make all
[ 3%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o
In file included from /home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:29:0:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp: In function ‘NativeElement* getLongArrayContentsGeneric(JNIEnv*, jlongArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:78:50: error: there are no arguments to ‘ThrowByName’ that depend on a template parameter, so a declaration of ‘ThrowByName’ must be available [-fpermissive]
"Out of memory during array creation");
^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:78:50: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp: In function ‘NativeElement* getIntArrayContentsGeneric(JNIEnv*, jintArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:111:50: error: there are no arguments to ‘ThrowByName’ that depend on a template parameter, so a declaration of ‘ThrowByName’ must be available [-fpermissive]
"Out of memory during array creation");
^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp: In function ‘void readFloatArrayContents(JNIEnv*, jfloatArray, float*, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:391:42: error: ‘memcpy’ was not declared in this scope
memcpy(target, a, len * sizeof(float));
^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp: In function ‘void writeFloatArrayContents(JNIEnv*, float*, jfloatArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:416:42: error: ‘memcpy’ was not declared in this scope
memcpy(a, source, len * sizeof(float));
^
make[2]: *** [jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/build.make:63: jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:143: jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
Building with GCC 7 gave almost the exact error message, which is why I tried GCC 6.
Output of cmake:
-- The C compiler identification is GNU 6.4.1
-- The CXX compiler identification is GNU 6.4.1
-- Check for working C compiler: /usr/bin/cc-6
-- Check for working C compiler: /usr/bin/cc-6 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++-6
-- Check for working CXX compiler: /usr/bin/c++-6 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /opt/cuda/bin/nvcc
-- Found JNI: /usr/lib/jvm/default/jre/lib/amd64/libjawt.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/jcuda
The memcpy part was due to a missing header include.
The message about ThrowByName
is a bit odd. Obviously, there is a declaration, but below the calling function. Most websearch results of this message seem to refer to some unrelated type deduction in class definitions.
I tried to fix this by moving the declaration up in https://github.com/jcuda/jcuda-common/commit/12ec08db49455ed86de1c14f533b605aec66df11
If this does not help, I’ll check whether there is another fix (or in doubt, how the -fpermissive
flag may be sneaked into GCC…)
Thanks, that worked. Some more warnings and one error as follows:
~/jcuda> make all
Scanning dependencies of target JCudaCommonJNI
[ 3%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o
[ 7%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/Logger.cpp.o
[ 10%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/PointerUtils.cpp.o
[ 14%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/CallbackUtils.cpp.o
[ 17%] Linking CXX static library ../../../../lib/libJCudaCommonJNI.a
[ 17%] Built target JCudaCommonJNI
Scanning dependencies of target JCudaDriver
[ 21%] Building CXX object jcuda/JCudaDriverJNI/bin/CMakeFiles/JCudaDriver.dir/src/JCudaDriver.cpp.o
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp: In function ‘bool getOptionValue(JNIEnv*, jobject, CUjit_option, void*&)’:
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:1112:28: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
value = (void*)v;
^
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:1124:28: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
value = (void*)iv;
^~
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp: In function ‘jint Java_jcuda_driver_JCudaDriver_cuIpcOpenMemHandleNative(JNIEnv*, jclass, jobject, jobject, jint)’:
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:2893:31: warning: converting to non-pointer type ‘CUdeviceptr {aka long long unsigned int}’ from NULL [-Wconversion-null]
CUdeviceptr nativePdptr = NULL;
^~~~
[ 25%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCudaDriver-0.8.0-linux-x86_64.so
[ 25%] Built target JCudaDriver
Scanning dependencies of target JCudaRuntime
[ 28%] Building CXX object jcuda/JCudaRuntimeJNI/bin/CMakeFiles/JCudaRuntime.dir/src/JCudaRuntime.cpp.o
[ 32%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCudaRuntime-0.8.0-linux-x86_64.so
[ 32%] Built target JCudaRuntime
Scanning dependencies of target JNvrtc
[ 35%] Building CXX object jcuda/JNvrtcJNI/bin/CMakeFiles/JNvrtc.dir/src/JNvrtc.cpp.o
[ 39%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJNvrtc-0.8.0-linux-x86_64.so
[ 39%] Built target JNvrtc
Scanning dependencies of target JCublas
[ 42%] Building CXX object jcublas/JCublasJNI/bin/CMakeFiles/JCublas.dir/src/JCublas.cpp.o
[ 46%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCublas-0.8.0-linux-x86_64.so
[ 46%] Built target JCublas
Scanning dependencies of target JCublas2
[ 50%] Building CXX object jcublas/JCublas2JNI/bin/CMakeFiles/JCublas2.dir/src/JCublas2.cpp.o
[ 53%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCublas2-0.8.0-linux-x86_64.so
[ 53%] Built target JCublas2
Scanning dependencies of target JCufft
[ 57%] Building CXX object jcufft/JCufftJNI/bin/CMakeFiles/JCufft.dir/src/JCufft.cpp.o
[ 60%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCufft-0.8.0-linux-x86_64.so
[ 60%] Built target JCufft
Scanning dependencies of target JCurand
[ 64%] Building CXX object jcurand/JCurandJNI/bin/CMakeFiles/JCurand.dir/src/JCurand.cpp.o
[ 67%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCurand-0.8.0-linux-x86_64.so
[ 67%] Built target JCurand
Scanning dependencies of target JCusparse
[ 71%] Building CXX object jcusparse/JCusparseJNI/bin/CMakeFiles/JCusparse.dir/src/JCusparse.cpp.o
[ 75%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCusparse-0.8.0-linux-x86_64.so
[ 75%] Built target JCusparse
Scanning dependencies of target JCusolver
[ 78%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolver.cpp.o
[ 82%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverDn.cpp.o
[ 85%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverRf.cpp.o
[ 89%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverSp.cpp.o
[ 92%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCusolver-0.8.0-linux-x86_64.so
[ 92%] Built target JCusolver
Scanning dependencies of target JNvgraph
[ 96%] Building CXX object jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/src/JNvgraph.cpp.o
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCSC32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:184:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
delete nativeObject->nativeTopologyData;
^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCSR32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:245:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
delete nativeObject->nativeTopologyData;
^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCOO32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:306:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
delete nativeObject->nativeTopologyData;
^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘nvgraphTopologyData* initNativeTopologyData(JNIEnv*, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:341:12: error: cannot convert ‘bool’ to ‘nvgraphTopologyData*’ in return
return false;
^~~~~
make[2]: *** [jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/build.make:63: jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/src/JNvgraph.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:638: jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/all] Error 2
make: *** [Makefile:84: all] Error 2
Thanks for these hints. I’ll fix them ASAP. (And I should probably at least set up a VirtualBox here to catch these GCC compile errors earlier…)
However, the error only seems to refer to NVGraph
. Do you need this for your tests? The other relevant libs (mainly the driver- and runtime libs) should have been built properly, so you might already want to give these a try.
Sorry for being insistive here, but did you have a chance to try it out? (Or is the missing JNvgraph
lib a “blocker” here? If so, I’d sort it up in my TODO list)
Sorry, I got held up. There are no JNI stall messages, but it is still deadlocking in cuMemcpy2DNative. Since my C++ dev knowledge is rather weak these days, can you tell me how I can build with debug symbols and launch a debugger on Linux to get a stack trace?
Unfortunately, I’m not so familiar with debugging on Linux. And although JCuda is a rather thin layer around CUDA, the path from the JVM into a loaded DLL is not so easy to trace. (There have been some reports that it is possible to use some of the NVIDIA debugging tools, but I haven’t investigated all options here).
Did you already activate the “baseline debugging”, namely, setting setExceptionsEnabled(true)
for all involved libraries? You might also try running the application with cuda-memcheck
, as described on http://jcuda.org/debugging/Debugging.html . If this doesn’t help, we can try to figure out which other options exist.
I was able to get the stack trace. It seems I’m experiencing a problem very similar to the one described here https://devtalk.nvidia.com/default/topic/892595/random-execution-times-and-freezes-with-concurent-kernels-2/ It may be a driver/hardware bug, because things get printed in the kernel log.
The thread is 2 years old, and this may justify some doubts that it may be related to the driver version: The one that is reported there is 340.58, and the current one is 385.54. Similarly, they are talking about CUDA <6.5 there.
You mentioned that it seemed to be stuck at cuMemcpy2DNative
, which might be more similar to the thread that is linked there, https://devtalk.nvidia.com/default/topic/860775/cpu-hangs-when-calling-thrust-copy_if/ - but this was supposed to be resolved as well.
Are there any further similarities between the problem that is reported there and the one that you observed? (The OS, the actual device, or others…?)
The thread seems somewhat inconclusive, and there are still many unknowns. Do you have the chance to run the program on a different OS/device, to narrow down the search space? (If necessary/desired, I could provide “snapshot” DLLs of the no_critical
branch - although I’m currently working on the CUDA 9 update, and some other stuff still has to be fixed).
On further investigation (cuda-gdb and cuda-memcheck), it looks like it’s getting stuck in CUBLAS, in maxwell_sgemm_128x64_raggedMn_nn_splitK_EPILOG_SPIN_WAIT()
After commenting out matrix multiplies in my code, I don’t see the hangs. Must be a bug in CUBLAS. I need to try CUDA 9.
Some results of testing the no_critical
branch after commenting out CUBLAS.
-
(this was frustrating) the native libraries in
/tmp
do not get overwritten. It took me a while to figure out why I was still seeing “stalled by JNI critical section” messages. They need to be deleted manually. -
No more “stalled by JNI critical section” messages.
-
Comparing
no_critical
to the public JCuda, runtime becomes much worse from 3 minutes to over 10. This is very strange. The code does a lot more than host->device copies. Even a microbenchmark of copies shouldn’t perform so badly, since CPU->CPU copies are much faster than CPU->GPU.
I see why the code is slow. I am doing frequent, small (4K) copies from a large bytebuffer (16M), and Get*ArrayElements
copies then entire 16M. Using Get*ArrayRegion
would help, but I’m planning to switch to pinned memory so it is not a problem for me.
It’s always hard to pin down the culprit in a large codebase, particularly with CUDA: Basically all the methods return error codes, but can “also return error codes from previous async operations”, so the hang in a CUBLAS function might only be a symptom of an earlier error.
-
Sorry about that. I already considered to add some sort of
-forceOverwrite
flag for cases where the DLLs (with the same name) are supposed to be replaced -
That sounds good, but has to be evaluated against other possible drawbacks, like 3.
-
The reason that you mentioned in the last post may explain this. The current
no_critical
state was mainly intended to figure out where the error comes from. If it makes its way intomaster
, there are certainly some points that have to be reviewed. One potential drawback of theRegion
methods is that they will always copy the data. TheGetElements
methods may still return the original array, but of course, the conditions under which each event occurs are hard to analyze…
I meant that cuda-gdb shows that the GPU is stuck inside the CUBLAS kernel. I think I just realized what the problem is. CUBLAS v1 API is not reentrant. I need to use CUBLAS v2.
I’ve implemented pinned memory and now performance is on-par. Next I need to implement CUBLAS 2 and streams-per-thread. Then I will be able to give a definitive verdict on performance.