Multiple threads -- GetPrimitiveArrayCritical

I’m trying to compile JCuda under Arch Linux using GCC 6.4.1 and Java 8. I’m following the Linux script (not executing it, just following). I cloned all the repositories, I checked out branch master except for jcuda-common and jcuda which are on branch no_critical. I ran the following line:

 env CXX=/usr/bin/c++-6 CC=/usr/bin/cc-6 cmake -DCUDA_nvrtc_LIBRARY=/opt/cuda/lib64/libnvrtc.so ./jcuda-main

followed by

make all

and I’m stuck on

~/jcuda> make all
[  3%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o
In file included from /home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:29:0:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp: In function ‘NativeElement* getLongArrayContentsGeneric(JNIEnv*, jlongArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:78:50: error: there are no arguments to ‘ThrowByName’ that depend on a template parameter, so a declaration of ‘ThrowByName’ must be available [-fpermissive]
             "Out of memory during array creation");
                                                  ^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:78:50: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp: In function ‘NativeElement* getIntArrayContentsGeneric(JNIEnv*, jintArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.hpp:111:50: error: there are no arguments to ‘ThrowByName’ that depend on a template parameter, so a declaration of ‘ThrowByName’ must be available [-fpermissive]
             "Out of memory during array creation");
                                                  ^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp: In function ‘void readFloatArrayContents(JNIEnv*, jfloatArray, float*, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:391:42: error: ‘memcpy’ was not declared in this scope
     memcpy(target, a, len * sizeof(float));
                                          ^
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp: In function ‘void writeFloatArrayContents(JNIEnv*, float*, jfloatArray, int*)’:
/home/user/jcuda/jcuda-common/JCudaCommonJNI/src/JNIUtils.cpp:416:42: error: ‘memcpy’ was not declared in this scope
     memcpy(a, source, len * sizeof(float));
                                          ^
make[2]: *** [jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/build.make:63: jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:143: jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

Building with GCC 7 gave almost the exact error message, which is why I tried GCC 6.

Output of cmake:

-- The C compiler identification is GNU 6.4.1
-- The CXX compiler identification is GNU 6.4.1
-- Check for working C compiler: /usr/bin/cc-6
-- Check for working C compiler: /usr/bin/cc-6 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++-6
-- Check for working CXX compiler: /usr/bin/c++-6 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /opt/cuda/bin/nvcc
-- Found JNI: /usr/lib/jvm/default/jre/lib/amd64/libjawt.so  
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/jcuda

The memcpy part was due to a missing header include.

The message about ThrowByName is a bit odd. Obviously, there is a declaration, but below the calling function. Most websearch results of this message seem to refer to some unrelated type deduction in class definitions.

I tried to fix this by moving the declaration up in https://github.com/jcuda/jcuda-common/commit/12ec08db49455ed86de1c14f533b605aec66df11

If this does not help, I’ll check whether there is another fix (or in doubt, how the -fpermissive flag may be sneaked into GCC…)

Thanks, that worked. Some more warnings and one error as follows:

~/jcuda> make all
Scanning dependencies of target JCudaCommonJNI
[  3%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/JNIUtils.cpp.o
[  7%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/Logger.cpp.o
[ 10%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/PointerUtils.cpp.o
[ 14%] Building CXX object jcuda/JCudaDriverJNI/bin/bin/CMakeFiles/JCudaCommonJNI.dir/src/CallbackUtils.cpp.o
[ 17%] Linking CXX static library ../../../../lib/libJCudaCommonJNI.a
[ 17%] Built target JCudaCommonJNI
Scanning dependencies of target JCudaDriver
[ 21%] Building CXX object jcuda/JCudaDriverJNI/bin/CMakeFiles/JCudaDriver.dir/src/JCudaDriver.cpp.o
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp: In function ‘bool getOptionValue(JNIEnv*, jobject, CUjit_option, void*&)’:
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:1112:28: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
             value = (void*)v;
                            ^
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:1124:28: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
             value = (void*)iv;
                            ^~
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp: In function ‘jint Java_jcuda_driver_JCudaDriver_cuIpcOpenMemHandleNative(JNIEnv*, jclass, jobject, jobject, jint)’:
/home/user/jcuda/jcuda/JCudaDriverJNI/src/JCudaDriver.cpp:2893:31: warning: converting to non-pointer type ‘CUdeviceptr {aka long long unsigned int}’ from NULL [-Wconversion-null]
     CUdeviceptr nativePdptr = NULL;
                               ^~~~
[ 25%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCudaDriver-0.8.0-linux-x86_64.so
[ 25%] Built target JCudaDriver
Scanning dependencies of target JCudaRuntime
[ 28%] Building CXX object jcuda/JCudaRuntimeJNI/bin/CMakeFiles/JCudaRuntime.dir/src/JCudaRuntime.cpp.o
[ 32%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCudaRuntime-0.8.0-linux-x86_64.so
[ 32%] Built target JCudaRuntime
Scanning dependencies of target JNvrtc
[ 35%] Building CXX object jcuda/JNvrtcJNI/bin/CMakeFiles/JNvrtc.dir/src/JNvrtc.cpp.o
[ 39%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJNvrtc-0.8.0-linux-x86_64.so
[ 39%] Built target JNvrtc
Scanning dependencies of target JCublas
[ 42%] Building CXX object jcublas/JCublasJNI/bin/CMakeFiles/JCublas.dir/src/JCublas.cpp.o
[ 46%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCublas-0.8.0-linux-x86_64.so
[ 46%] Built target JCublas
Scanning dependencies of target JCublas2
[ 50%] Building CXX object jcublas/JCublas2JNI/bin/CMakeFiles/JCublas2.dir/src/JCublas2.cpp.o
[ 53%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCublas2-0.8.0-linux-x86_64.so
[ 53%] Built target JCublas2
Scanning dependencies of target JCufft
[ 57%] Building CXX object jcufft/JCufftJNI/bin/CMakeFiles/JCufft.dir/src/JCufft.cpp.o
[ 60%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCufft-0.8.0-linux-x86_64.so
[ 60%] Built target JCufft
Scanning dependencies of target JCurand
[ 64%] Building CXX object jcurand/JCurandJNI/bin/CMakeFiles/JCurand.dir/src/JCurand.cpp.o
[ 67%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCurand-0.8.0-linux-x86_64.so
[ 67%] Built target JCurand
Scanning dependencies of target JCusparse
[ 71%] Building CXX object jcusparse/JCusparseJNI/bin/CMakeFiles/JCusparse.dir/src/JCusparse.cpp.o
[ 75%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCusparse-0.8.0-linux-x86_64.so
[ 75%] Built target JCusparse
Scanning dependencies of target JCusolver
[ 78%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolver.cpp.o
[ 82%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverDn.cpp.o
[ 85%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverRf.cpp.o
[ 89%] Building CXX object jcusolver/JCusolverJNI/bin/CMakeFiles/JCusolver.dir/src/JCusolverSp.cpp.o
[ 92%] Linking CXX shared library ../../nativeLibraries/linux/x86_64/lib/libJCusolver-0.8.0-linux-x86_64.so
[ 92%] Built target JCusolver
Scanning dependencies of target JNvgraph
[ 96%] Building CXX object jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/src/JNvgraph.cpp.o
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCSC32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:184:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
     delete nativeObject->nativeTopologyData;
                          ^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCSR32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:245:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
     delete nativeObject->nativeTopologyData;
                          ^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘bool releaseNativeTopologyDataCOO32I(JNIEnv*, nvgraphTopologyData*&, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:306:26: warning: deleting ‘void*’ is undefined [-Wdelete-incomplete]
     delete nativeObject->nativeTopologyData;
                          ^~~~~~~~~~~~~~~~~~
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp: In function ‘nvgraphTopologyData* initNativeTopologyData(JNIEnv*, _jobject*&)’:
/home/user/jcuda/jnvgraph/JNvgraphJNI/src/JNvgraph.cpp:341:12: error: cannot convert ‘bool’ to ‘nvgraphTopologyData*’ in return
     return false;
            ^~~~~
make[2]: *** [jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/build.make:63: jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/src/JNvgraph.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:638: jnvgraph/JNvgraphJNI/bin/CMakeFiles/JNvgraph.dir/all] Error 2
make: *** [Makefile:84: all] Error 2

Thanks for these hints. I’ll fix them ASAP. (And I should probably at least set up a VirtualBox here to catch these GCC compile errors earlier…)

However, the error only seems to refer to NVGraph. Do you need this for your tests? The other relevant libs (mainly the driver- and runtime libs) should have been built properly, so you might already want to give these a try.

Sorry for being insistive here, but did you have a chance to try it out? (Or is the missing JNvgraph lib a “blocker” here? If so, I’d sort it up in my TODO list)

Sorry, I got held up. There are no JNI stall messages, but it is still deadlocking in cuMemcpy2DNative. Since my C++ dev knowledge is rather weak these days, can you tell me how I can build with debug symbols and launch a debugger on Linux to get a stack trace?

Unfortunately, I’m not so familiar with debugging on Linux. And although JCuda is a rather thin layer around CUDA, the path from the JVM into a loaded DLL is not so easy to trace. (There have been some reports that it is possible to use some of the NVIDIA debugging tools, but I haven’t investigated all options here).

Did you already activate the “baseline debugging”, namely, setting setExceptionsEnabled(true) for all involved libraries? You might also try running the application with cuda-memcheck, as described on http://jcuda.org/debugging/Debugging.html . If this doesn’t help, we can try to figure out which other options exist.

I was able to get the stack trace. It seems I’m experiencing a problem very similar to the one described here https://devtalk.nvidia.com/default/topic/892595/random-execution-times-and-freezes-with-concurent-kernels-2/ It may be a driver/hardware bug, because things get printed in the kernel log.

The thread is 2 years old, and this may justify some doubts that it may be related to the driver version: The one that is reported there is 340.58, and the current one is 385.54. Similarly, they are talking about CUDA <6.5 there.

You mentioned that it seemed to be stuck at cuMemcpy2DNative, which might be more similar to the thread that is linked there, https://devtalk.nvidia.com/default/topic/860775/cpu-hangs-when-calling-thrust-copy_if/ - but this was supposed to be resolved as well.

Are there any further similarities between the problem that is reported there and the one that you observed? (The OS, the actual device, or others…?)

The thread seems somewhat inconclusive, and there are still many unknowns. Do you have the chance to run the program on a different OS/device, to narrow down the search space? (If necessary/desired, I could provide “snapshot” DLLs of the no_critical branch - although I’m currently working on the CUDA 9 update, and some other stuff still has to be fixed).

On further investigation (cuda-gdb and cuda-memcheck), it looks like it’s getting stuck in CUBLAS, in maxwell_sgemm_128x64_raggedMn_nn_splitK_EPILOG_SPIN_WAIT() After commenting out matrix multiplies in my code, I don’t see the hangs. Must be a bug in CUBLAS. I need to try CUDA 9.

Some results of testing the no_critical branch after commenting out CUBLAS.

  1. (this was frustrating) the native libraries in /tmp do not get overwritten. It took me a while to figure out why I was still seeing “stalled by JNI critical section” messages. They need to be deleted manually.

  2. No more “stalled by JNI critical section” messages.

  3. Comparing no_critical to the public JCuda, runtime becomes much worse from 3 minutes to over 10. This is very strange. The code does a lot more than host->device copies. Even a microbenchmark of copies shouldn’t perform so badly, since CPU->CPU copies are much faster than CPU->GPU.

I see why the code is slow. I am doing frequent, small (4K) copies from a large bytebuffer (16M), and Get*ArrayElements copies then entire 16M. Using Get*ArrayRegion would help, but I’m planning to switch to pinned memory so it is not a problem for me.

It’s always hard to pin down the culprit in a large codebase, particularly with CUDA: Basically all the methods return error codes, but can “also return error codes from previous async operations”, so the hang in a CUBLAS function might only be a symptom of an earlier error.

  1. Sorry about that. I already considered to add some sort of -forceOverwrite flag for cases where the DLLs (with the same name) are supposed to be replaced

  2. That sounds good, but has to be evaluated against other possible drawbacks, like 3.

  3. The reason that you mentioned in the last post may explain this. The current no_critical state was mainly intended to figure out where the error comes from. If it makes its way into master, there are certainly some points that have to be reviewed. One potential drawback of the Region methods is that they will always copy the data. The GetElements methods may still return the original array, but of course, the conditions under which each event occurs are hard to analyze…

I meant that cuda-gdb shows that the GPU is stuck inside the CUBLAS kernel. I think I just realized what the problem is. CUBLAS v1 API is not reentrant. I need to use CUBLAS v2.

I’ve implemented pinned memory and now performance is on-par. Next I need to implement CUBLAS 2 and streams-per-thread. Then I will be able to give a definitive verdict on performance.

After fixing CUBLAS, there’s no more deadlocks. I fully parallelized computation between threads using streams with full use of pinned memory and async memcopys.

G1GC: Performance on a test workload running 16 threads (16-core CPU) is essentially identical between critical and no_critical (within 0.5%). However, the heap had to be increased from 1100 MB (no_critical) to 1700 MB (critical). Otherwise, G1 throws OutOfMemoryErrors.

ParallelOldGC: Performance is identical on the same workload. Heap size can be reduced to 700 MB for both versions.

The workload launches very small kernels and uses moderate CPU. I’ll update this thread when I test on more workloads. But I don’t launch very long kernels. Those should cause longer GC locks and have a greater impact on performance.

That’s some interesting insights. The transition from CUBLAS to “CUBLAS v2” was a little more transparent and silent on CUDA side: They just introduced a “cublas_v2.h” header, which at some point basically replaced the “cublas.h” header. In favor of some sort of backward compatibility, I decided to reflect this in JCuda with “JCublas” and “JCublas2”.

In view of the original issue that caused this thread, I’m not (yet) entirely sure about the conclusions. From what I understood now:

  • The locks and crashes have not been directly related to the Critical methods
  • The fixes/changes of the no_critical version are not strictly necessary. Nevertheless, some of them are certainly appropriate, because using the Critical methods for things like 3-element-arrays is certainly not necessary (and might(!) increase the GC stalls)
  • Some other points (referring to the async methods) still have to be fixed separately

I’m a bit confused that the heap had to be increased for the critical version - or did you confuse these two when writing the above post? Also, the role of the pinned memory (also in view of the Get..Elements vs. Get..Region issue) is not yet entirely clear. And finally, I wonder whether there are other application patterns where the no_critical change might have a greater impact on performance or memory consumption. This still has to be sorted out

In any case, I’m curious about further results of your local tests.

The switch to CUBLAS v2 isn’t that transparent. I noticed the SGEMM performance fell 30-50% for my small matrices, and the actual kernels launched switched from maxwell_* to magma_* (running on Kepler). I note that CUBLAS v2 allows choosing the GEMM algorithm, so I’ll refrain from complaining until I test all the algos.

Yes, heap on G1 increased for critical, probably because of how the GC handles memory locking. (Perhaps it locks regions, and throws OOM when all regions are locked instead of blocking).

The role of pinned memory is to use Async memcopies and avoid use of Get*Elements. All my host<->device memcopies are now async using pinned memory. (Get*Elements/Get*Critical is presumably only used for passing parameters to kernels.)

Playing with CUBLAS v2 SGEMM algos proved useless. None of them work, much less fix the performance. CUBLAS v1 works fine as long as I synchronize the calls. It might be nice to document this fact.

Regarding the performance drop: This is „unusual“, at least. However, CUBLAS uses some very, very black magic internally, to use different kernels for different target platforms, as indicated by the kernel name change that you mentioned - so this might be some sort of a regression…

The question regarding the pinned memory: This referred to the fact that there might be drawbacks of the current changes in the no_critical branch that you no longer notice, because you switched to pinned memory.

Playing with CUBLAS v2 SGEMM algos proved useless. None of them work, much less fix the performance.

It is not entirely clear (and not really documented, from what I see) what exactly the difference between these algorithms is. They are likely some sort of „implementation detail“. There seem to be some constraints for when each algorithm may be used. (For example, there might be an algorithm that only works for certain data types, or only for quadratic matrices, or only matrices that have a size that is a power of 2, or whatnot…)

Here on my GTX970/Win8.1, the following sample works, but the performance differences are not really significant:

/*
 * JCuda - Java bindings for NVIDIA CUDA
 *
 * Copyright 2008-2016 Marco Hutter - http://www.jcuda.org
 */
package jcuda.jcublas.samples;

import static jcuda.cudaDataType.CUDA_R_32F;
import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasGemmEx;
import static jcuda.jcublas.JCublas2.cublasGetVector;
import static jcuda.jcublas.JCublas2.cublasSetVector;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO0;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO2;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO4;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO5;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO6;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO7;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.samples.utils.JCudaSamplesUtils;

/**
 * This is a sample class demonstrating the application of JCublas2 for
 * performing a BLAS 'sgemm' operation, i.e. for computing the matrix <br>
 * <code>C = alpha * A * B + beta * C</code> <br>
 * for single-precision floating point values alpha and beta, and matrices 
 * A, B and C, using the extended CUBLAS GEMM function
 */
public class JCublas2SgemmExSample
{
    public static void main(String args[])
    {
        JCublas2.setExceptionsEnabled(true);
        testSgemm(2000);
    }
    
    // The list of CUBLAS GEMM algorithms to use. Note that the set of
    // supported algorithms will likely depend on the platform, the
    // size of the matrix, and other factors.
    private static final List<Integer> GEMM_ALGORITHMS = Arrays.asList(
        CUBLAS_GEMM_ALGO2,
        CUBLAS_GEMM_ALGO4,
        CUBLAS_GEMM_ALGO5,
        CUBLAS_GEMM_ALGO6,
        CUBLAS_GEMM_ALGO7
    );
    private static int GEMM_ALGO = CUBLAS_GEMM_ALGO0;

    /**
     * Test the JCublas sgemm operation for matrices of size n x x
     * 
     * @param n The matrix size
     */
    public static void testSgemm(int n)
    {
        float alpha = 0.3f;
        float beta = 0.7f;
        int nn = n * n;

        System.out.println("Creating input data...");
        float h_A[] = JCudaSamplesUtils.createRandomFloatData(nn);
        float h_B[] = JCudaSamplesUtils.createRandomFloatData(nn);
        float h_C[] = JCudaSamplesUtils.createRandomFloatData(nn);

        System.out.println("Performing Sgemm with JCublas...");
        for (int i : GEMM_ALGORITHMS)
        {
            GEMM_ALGO = i;
            try
            {
                sgemmJCublas(n, alpha, h_A, h_B, beta, h_C);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }

    }

    /**
     * Implementation of sgemm using JCublas
     */
    private static void sgemmJCublas(
        int n, float alpha, float A[], float B[], float beta, float C[])
    {
        int nn = n * n;

        // Create a CUBLAS handle
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, nn * Sizeof.FLOAT);
        cudaMalloc(d_B, nn * Sizeof.FLOAT);
        cudaMalloc(d_C, nn * Sizeof.FLOAT);

        // Copy the memory from the host to the device
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(A), 1, d_A, 1);
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(B), 1, d_B, 1);
        cublasSetVector(nn, Sizeof.FLOAT, Pointer.to(C), 1, d_C, 1);

        // Execute sgemm
        Pointer pAlpha = Pointer.to(new float[] { alpha });
        Pointer pBeta = Pointer.to(new float[] { beta });
        
        long before = System.nanoTime();
        
        cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, 
            pAlpha, d_A, CUDA_R_32F, n, d_B, CUDA_R_32F, n, 
            pBeta, d_C, CUDA_R_32F, n, CUDA_R_32F, GEMM_ALGO);
        
        cudaDeviceSynchronize();
        
        long after = System.nanoTime();
        double durationMs = (after - before) / 1e6;
        System.out.println(
            "Algorithm " + GEMM_ALGO + " took " + durationMs + " ms");

        // Copy the result from the device to the host
        cublasGetVector(nn, Sizeof.FLOAT, d_C, 1, Pointer.to(C), 1);

        // Clean up
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        cublasDestroy(handle);
    }

}

The output is

Creating input data...
Performing Sgemm with JCublas...
Algorithm 2 took 6.009423 ms
Algorithm 4 took 5.844668 ms
Algorithm 5 took 6.196667 ms
Algorithm 6 took 5.925823 ms
Algorithm 7 took 6.093024 ms

Maybe the differences are larger for other configurations…?


In any case, you mentioned that you have small matrices. If you have many of them: Did you already consider using the batched SGEMM? I just added a sample that I still had lying around here, at jcuda-samples/JCudaSamples/src/main/java/jcuda/jcublas/samples/JCublas2SgemmBatched.java at master · jcuda/jcuda-samples · GitHub (but of course, I did not do any benchmarks here yet…)

Thank you for the sample. It turns out that cublasSetAtomicsMode (handle, cublasAtomicsMode.CUBLAS_ATOMICS_ALLOWED) breaks cublasGemmEx. I blindly included it because the cuBLAS docs said it could boost performance.

I expanded the sample to be a bit more thorough and to use CUDA events:

/*
 * JCuda - Java bindings for NVIDIA CUDA
 *
 * Copyright 2008-2016 Marco Hutter - http://www.jcuda.org
 */
package jcuda.jcublas.samples;

import static jcuda.cudaDataType.CUDA_R_32F;
import static jcuda.jcublas.JCublas2.*;
import static jcuda.jcublas.cublasGemmAlgo.*;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_T;
import static jcuda.runtime.JCuda.*;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.jcublas.cublasAtomicsMode;
import jcuda.jcublas.cublasPointerMode;
import jcuda.runtime.cudaStream_t;
import jcuda.runtime.cudaEvent_t;
import jcuda.runtime.JCuda;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.JCudaDriver;
import jcuda.samples.utils.JCudaSamplesUtils;

/**
 * This is a sample class demonstrating the application of JCublas2 for
 * performing a BLAS 'sgemm' operation, i.e. for computing the matrix <br>
 * <code>C = alpha * A * B + beta * C</code> <br>
 * for single-precision floating point values alpha and beta, and matrices 
 * A, B and C, using the extended CUBLAS GEMM function
 */
public class JCublas2SgemmExSample
{
    public static void main(String args[])
    {
        JCuda.setExceptionsEnabled(true);
        
        CUdevice device = new CUdevice();
        CUcontext context = new CUcontext();
    
        JCudaDriver.cuInit(0);
        JCudaDriver.cuDeviceGet(device, 0);
        JCudaDriver.cuCtxCreate(context, 0, device);
            
        testSgemm();
        
        JCudaDriver.cuCtxDestroy (context);
    }
    
    // Remember, matrices are in column-major order
    // and pitch is also column-major
    final static boolean TRANSPOSE_A = true;
    final static boolean TRANSPOSE_B = true;
    final static int M = 2048; // number of rows of matrix op(A) and rows of matrix C
    final static int N = 2048; // number of columns of matrix op(B) and number of columns of C
    final static int K = 2048; // number of columns of matrix op(A) and number of rows of op(B)
    final static int LDA = 2048; // pitch of matrix A (before the transpose)
    final static int LDB = 2048; // pitch of matrix B (before the transpose)
    final static int LDC = 2048; // pitch of matrix C
    final static float ALPHA = 1f;
    final static float BETA = 0f;
    
    // The list of CUBLAS GEMM algorithms to use. Note that the set of
    // supported algorithms will likely depend on the platform, the
    // size of the matrix, and other factors.
    private static final List<Integer> GEMM_ALGORITHMS = Arrays.asList(
        CUBLAS_GEMM_DFALT,
        CUBLAS_GEMM_ALGO0, // doesn't seem to ever work
        CUBLAS_GEMM_ALGO1, // doesn't seem to ever work
        CUBLAS_GEMM_ALGO2,
        CUBLAS_GEMM_ALGO3,
        CUBLAS_GEMM_ALGO4,
        CUBLAS_GEMM_ALGO5,
        CUBLAS_GEMM_ALGO6,
        CUBLAS_GEMM_ALGO7
    );

    /**
     * Test the JCublas sgemm operation for matrices of size n x x
     * 
     * @param n The matrix size
     */
    public static void testSgemm ()
    {
        System.out.println("Creating input data...");
        float h_A[] = JCudaSamplesUtils.createRandomFloatData(K * LDA);
        float h_B[] = JCudaSamplesUtils.createRandomFloatData(N * LDB);
        float h_C[] = JCudaSamplesUtils.createRandomFloatData(N * LDC);

        System.out.println("Performing Sgemm with cuBLAS v1...");
        {
            try
            {
                sgemmJCublas (h_A, h_B, h_C, false, 0);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }
        System.out.println("Performing Sgemm with cuBLAS v2...");
        for (int algo : GEMM_ALGORITHMS)
        {
            try
            {
                sgemmJCublas (h_A, h_B, h_C, true, algo);
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }
        System.out.println();
        System.out.println("Timing is inaccurate because the GPU doesn't have time to enter high-performance state.");
        System.out.println("Abnormally quick execution time indicates failure and incompatibility of the given algorithm.");
    }
    
    /**
     * Implementation of sgemm using JCublas
     */
    private static void sgemmJCublas (float A[], float B[], float C[], boolean useJCublas2, int algo)
    {
        // Create a CUBLAS handle
        cublasHandle handle = null;
        if (useJCublas2)
        {
            handle = new cublasHandle();
            cublasCreate(handle);
            // Causes CUBLAS v2 to fail (Linux, CUDA 8)
//            cublasSetAtomicsMode (handle, cublasAtomicsMode.CUBLAS_ATOMICS_ALLOWED);
            cublasSetPointerMode (handle, cublasPointerMode.CUBLAS_POINTER_MODE_HOST);
        }
        else
            JCublas.cublasInit();

        // Allocate memory on the device
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, K * LDA * Sizeof.FLOAT);
        cudaMalloc(d_B, N * LDB * Sizeof.FLOAT);
        cudaMalloc(d_C, N * LDC * Sizeof.FLOAT);

        // Copy the memory from the host to the device
        cublasSetVector (K * LDA, Sizeof.FLOAT, Pointer.to(A), 1, d_A, 1);
        cublasSetVector (N * LDB, Sizeof.FLOAT, Pointer.to(B), 1, d_B, 1);
        cublasSetVector (N * LDC, Sizeof.FLOAT, Pointer.to(C), 1, d_C, 1);

        // Execute sgemm
        Pointer pAlpha = Pointer.to(new float[] { ALPHA });
        Pointer pBeta = Pointer.to(new float[] { BETA });
        
        cudaEvent_t startEvent = new cudaEvent_t();
        cudaEvent_t stopEvent = new cudaEvent_t();
        
        cudaEventCreate (startEvent);
        cudaEventCreate (stopEvent);
        
        // run twice for a bit more stable numbers
        for (int i = 0; i < 2; i++)
        {
            cudaEventRecord (startEvent, cudaStreamLegacy);
            if (useJCublas2)
                cublasGemmEx(handle, 
                    TRANSPOSE_A ? CUBLAS_OP_T : CUBLAS_OP_N, 
                    TRANSPOSE_B ? CUBLAS_OP_T : CUBLAS_OP_N, 
                    M, N, K, 
                    pAlpha, d_A, CUDA_R_32F, LDA, d_B, CUDA_R_32F, LDB, 
                    pBeta, d_C, CUDA_R_32F, LDC, CUDA_R_32F, algo);
            else
                JCublas.cublasSgemm
                    (TRANSPOSE_A ? 't' : 'n',
                    TRANSPOSE_B ? 't' : 'n',
                    M, N, K, 
                    ALPHA, d_A, LDA, d_B, LDB, 
                    BETA, d_C, LDC);
            cudaEventRecord (stopEvent, cudaStreamLegacy);
        }
        cudaEventSynchronize (stopEvent);
        
        float[] time = new float [1];
        cudaEventElapsedTime (time, startEvent, stopEvent);
        
        if (useJCublas2)
            System.out.println(
                "cuBLAS v2 algorithm " + algo + " took " + time[0] * 1000 + " us");
        else
            System.out.println(
                "cuBLAS v1 took " + time[0] * 1000 + " us");

        // Copy the result from the device to the host
        cublasGetVector (N * LDC, Sizeof.FLOAT, d_C, 1, Pointer.to(C), 1);

        // Clean up
        cudaEventDestroy (startEvent);
        cudaEventDestroy (stopEvent);
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        if (useJCublas2)
            cublasDestroy(handle);
        else
            JCublas.cublasShutdown();
    }

}

There does seem to be a bit of a performance difference between the algos, especially when matrices are transposed:

Performing Sgemm with cuBLAS v1...
cuBLAS v1 took 1600.096 us
Performing Sgemm with cuBLAS v2...
cuBLAS v2 algorithm -1 took 1571.264 us
cuBLAS v2 algorithm 0 took 7.328 us
cuBLAS v2 algorithm 1 took 6.784 us
cuBLAS v2 algorithm 2 took 1523.84 us
cuBLAS v2 algorithm 3 took 1593.0559 us
cuBLAS v2 algorithm 4 took 1719.872 us
cuBLAS v2 algorithm 5 took 1660.3201 us
cuBLAS v2 algorithm 6 took 2125.248 us
cuBLAS v2 algorithm 7 took 2459.808 us

Timing is inaccurate because the GPU doesn't have time to enter high-performance state.
Abnormally quick execution time indicates failure and incompatibility of the given algorithm.

Good catch, the transposed/non-transposed cases are likely to influence the performance if the different algorithms here as well. If you don’t mind, I’ll consider extending the example and maybe add it to the samples.

More generally: A benchmark of certain (J)CUBLAS functions has already been discussed in JCublas DSYRK and DGEMM benchmark . Maybe it contains some information relevant for you, but it’s a rather long thread with some discussion. It would probably really be good to condense this into a sample and/or a “Benchmarks” page on the website

There, I could also try to cover the batched version. I’d really like to know where the “break even” point is of individual SGEMMs vs. batched SGEMM.

It’s difficult to allocate the time though, but might be of general interest after all.