JCublas DSYRK and DGEMM benchmark

I think we are coming to an end of this discussion! Yes, your summary of what I am trying to do is roughly correct, I will send you some papers for the background. No, I am not interested in GFLOPS, I am interested in time. Yes, I am only interested in using existing libraries, not in coding CUDA kernels. Yes, SYRK should take roughly half the time of GEMM, as it does half the arithmetic, but the data transfer is similar, so it takes more than half the time. Yes, data transfer time is important, although I haven’t yet quantified data transfer time v. processing time.

But if you compare your times for

Dgemm; 200;40000; 25; 105.98647;30.19253;2854.37769;2802.38354;

and

Dgemm; 400;10000; 25; 112.71124;28.39114;2852.60962;2779.98804;

the number of processing operations are the same (200 x 200 x 400000 x 2 v. 400 x 400 x 10000 x 2) but the data transfer time is different (200 x 40000 x 64 bytes v 400 x 10000 x 64 bytes) but I’m not sure what your csv columns are exactly, so i don’t know whether your figures confirm this or not. It appears that the data transfer in MB/s and the GFLOPS are similar, so from that the total time can be determined. My guess is:

benchmark;cols;rows ;iterations;GFLOPS ;avg.ms ;MB/s HtoD ;MB/s DtoH ;

Dgemm; 200;40000; 25; 105.98647;30.19253;2854.37769;2802.38354;
Dgemm; 400;10000; 25; 112.71124;28.39114;2852.60962;2779.98804;

Single iteration processing data transfer total
(secs)

Dgemm; 200;40000; 0.3 0.05 0.35
Dgemm; 400;10000; 0.3 0.1 0.4

but I am probably completely wrong (by factors of 25 and/or 2), and I guess you have the timing figures? As you said some time back, extracting data transfer and processing times from my original benchmark would be useful.

That was one of the reasons for the question. It may be important to know when which data has to be transferred where in order to estimate the overall efforts. Again, a bit oversimplified: It does not make sense to compute the timing for

  • Host → Device copy
  • Scale with diagonal matrix
  • Device → Host copy
  • Host → Device copy
  • Gemm
  • Device → Host copy

when the actual process (that has to compete with the pure Java timings) in the end should only be

  • Host → Device copy
  • Scale with diagonal matrix
  • Gemm
  • Device → Host copy

Maybe the papers will bring some insights here.

But if you compare your times for

I’m not sure what your csv columns are exactly, so i don’t know whether your figures confirm this or not. It appears that the data transfer in MB/s and the GFLOPS are similar, so from that the total time can be determined.

but I am probably completely wrong (by factors of 25 and/or 2), and I guess you have the timing figures? As you said some time back, extracting data transfer and processing times from my original benchmark would be useful.

The columns do not (yet) contain the total time - and also not the time for the individual memory copies. The GFLOPs and the MB/s columns may be what is interesting when someone wants to compare the hardware. The MB/s should in most cases be similar for „reasonably large“ memory chunks, but also see the Bandwidth Test at http://jcuda.org/samples/JCudaBandwidthTest.java .
In the last table, I added the „avg.ms“ column, which is the average time for a single execution of the operation. I’ll add some more timings (for the memory transfers, and the total (average) time), because these are likely more interesting when comparing the algorithmic approaches.

We’re getting there!

In the actual algorithm, it is more like:

  • Host -> Device copy large matrix A 200 x 150,000 (CUDA)
    FOR i = 1 to 5 DO
  • do some stuff (Java)
  • construct W weighting diagonal matrix (1 x 150,000) (Java)
  • Scale A with diagonal matrix W using gmm (400 x 150,000 operations) (CUDA)
  • construct upper triangular matrix using syrk (150,000 x 400 x 400 /2 operations) (CUDA)
  • Device -> Host copy upper triangular (400 x 400)
  • factorise upper triangular matrix (Java)
  • back substitute (Java)
  • do some stuff (Java)
    ENDDO

so being able to move the large matrix copy host to device will make a big difference.

Let me go implement this in my main code.

(Not sure when I’ll have a chance to invest more time here, so just a short note: I noticed some factorization snippets in the code that you provided, and now you again mentioned the factorization - did you see the JCudaMatrixCgSample20141023.zip and the JCublasMatrixInvert.java on the samples page? They may contain some snippets that may be useful here, although I certainly would have to refresh my knowledge about what they are doing internally - just a small pointer, maybe it’s interesting for you)

It’s OK, contrary to ‘common wisdom’, which you will find all over the place in books and articles, factorisation is not necessarily the bottleneck. In this particular case it is far from the bottleneck. For 150,000 x 200, to construct the matrix takes 2 x 150,000 x 200 x 200 /2 = 6,000,000,000 operations, to factorise takes (i think) 200 x 200 x 200 /6 = 1,300,000 operations.

You often see this misconception.

Maybe one of the topics is ‘how do you optimise code’. The process is:

While you still have time and patience and enthusiasm {

  • find the bottleneck through experimentation and measurement
  • optimise the bottleneck
    }

Note that profilers can be VERY misleading, some Java compilers seem to inline code. I always use explicit code to gather timings, it is the only reliable way (unless anybody else knows better). My IDE is Netbeans. Similarly, you have to be careful what libraries like JCuda are doing - don;t assume that when a method returns, all the work has been done.

Of course (as you all know from books like Programming Pearls, and articles by Knuth and others), think long and hard about the data structures you are going to use.

In my 40 years of software development, I can hardly recall any occasion when my initial guess at the bottleneck has been correct. Sometimes in Java it is an obscure synchronisation which is the problem, deep inside some library method.I had this once with Jide charting library (which is excellent) , and they were kind enough to supply me with a revised version which removed the problem.

ps. the JCudas library implements some Cublas extensions which have some factorisation and backsubstitution methods.

See cuBLAS :: CUDA Toolkit Documentation

[QUOTE=NigelEssence]It’s OK, contrary to ‚common wisdom‘, which you will find all over the place in books and articles, factorisation is not necessarily the bottleneck. In this particular case it is far from the bottleneck. For 150,000 x 200, to construct the matrix takes 2 x 150,000 x 200 x 200 /2 = 6,000,000,000 operations, to factorise takes (i think) 200 x 200 x 200 /6 = 1,300,000 operations.
[/quote]

OK, sorry, I see, the factorization only applies to the (small) result matrix, so there’s probably not so much to gain.

Note that profilers can be VERY misleading, some Java compilers seem to inline code. I always use explicit code to gather timings, it is the only reliable way (unless anybody else knows better). My IDE is Netbeans. Similarly, you have to be careful what libraries like JCuda are doing - don;t assume that when a method returns, all the work has been done.

The latter is part of the API spec. There are explicit ...Async functions, and … with CUDA 5, I think,… they added support for streams to the runtime libraries. So in order to obtain timing information, one always has to use some form of synchronizers.

But the Java part confuses me. The Java Compiler itself does basically no optimization at all. You can look at the bytecode of a class with javap -c MyClass.class. Buf of course, the Hotspot Just-In-Time-Compiler does massive optimizations (inlining and unrolling being the „least crazy“ ones here). This can lead to quite unexpected results, and care has to be taken that timings in artificial benchmarks are not distorted by some JIT optimizations (see java - Why is „while (i++ < n) {}“ significantly slower than „while (++i < n) {}“ - Stack Overflow , or more recently, Java integer ++i not changing the value - Stack Overflow ). Things like this are to some extent covered by benchmarking frameworks like Caliper or JMH. (But even a „simple“ Microbenchmark can give rough indications for the performance, when some basic rules are kept in mind). Are you using any profilers, beyond the manual „curentTimeMillis“ timing? For a while, I thought that JVisualVM was the only „real“ option here (unless you’d like to spend big $$$s for such a profiler). But recently, Oracle has made Java Mission Control freely available, and it looks really interesting (I have not yet used it extensively, but the profiler seems to produce much more detailed and reliable results than JVisualVM (it might be a wrong impression, but I’ll definitely test it more throughoutly).

However, the confusing part: You mentioned
I always use explicit code…
I hope this does not mean that you are running this with -Xint, are you? This, I think, also distorts the results massively. Nobody is using a JVM that runs purely in interpreted mode…

Of course (as you all know from books like Programming Pearls, and articles by Knuth and others), think long and hard about the data structures you are going to use.

Sure. Inlining or unrolling may bring a few % here and there, but choosing the right data structures and algorithms can make the real difference - even beyond obvious O(n^2)-vs-O(nlogn) choices. (Nevertheless, this „optimization loop“ that you mentioned is somehow rewarding, and even if a different approach may sometimes be more beneficial, it may be fun to see how to squeeze out the last % for a specific implementation).

By explicit code, I simply mean I wrap some timing gathering around some methods to accumulate times. My Java runtime options are:

-Djava.library.path=d:\JCuda -XX:+AggressiveOpts -XX:CompileThreshold=1000 -XX:+UseBiasedLocking -XX:+UseCompressedOops -Xms128m -Xmx8192m -server -verbosegc

The server option is probably the most important.

Yes I didn’t really mean the compiler, I meant the run time optimisations.

I am not going to fiddle about with these options - as you and I know, the JVM is intensely complex, I was just adding a note of caution if anybody thinks the Netbeans profile gives accurate method timings.

Latest changes have sped up by maybe 25% from previous Cublas, where I only used syrk, and a couple of hours work will see a further significant increase in performance as other methods have started to emerge as the bottlenecks, and can benefit from JCublas.

OK, then this “explicit code” is what I’d roughly call a “manual microbenchmark”. It may be tricky to get this right, but

  • iterated execution
  • keeping an eye on the GC
  • Not throwing away the computation results
    are the most important points to get a swomewhat reliable and realistic measure.

Indeed, the -server flag can make huge differences (at least on Windows), and the -verbosegc flag is important during development time (and to make sure that micorbenchmark timings are not distorted too much by the GC). The other ones seem very specific, and I have rough ideas what they do, but haven’t yet investigated their actual effects on the performance for specific constellations (there are many more, and the “XX” actually means that they are not stable and subject to change any time, and eventually, looking up (and understanding) what they actually do in the Hotspot JIT source code is hardly possible with reasonable effort).

Indeed, all this makes profiling very difficult. The results from profiling and sampling in the JVisualVM literally seem to be completely unrelated to each other - and if in one case, method A seems to use 80% of the time, and in the other case, method B seems to be the culprit, it’s hard to to do focussed optimizations. However, I’ll give the “Mission Control” from Oracle another try, when I have more time … (hm).

I know :smiley: But it may be useful for others. I have CUDA’d all my bottlenecks, and it is now about 35% faster compared to when I only CUDA’d the syrk.

I would call my manual benchmarks milli benchmarks rather than micro benchmarks. I don’t see any point in micro benchmarks, they are completely unrepresentative of real world, but maybe JVM specialists may be interested.

Latest Netbeans (8.1) seems to have changed the profiler a lot, visually at least, but it still gives almost zero time for any methods which have been inlined away. I scratched my head for some time before I realised what was happening. I think the server option does a lot more actual compilation to machine code, and probably a lot more inlining.

The trick seems to be to make code simple enough so the Java optimisations have a chance.

Java performance is rather different compared with when I first started using it in 1999. All the myths have been debunked.

Sure. It’s probably difficult (or even plainly impossible) to retain the information about inlined methods (to show them in such a nice call tree), and still gather the precise timings - when a method like

void foo() {
    barA();
    barB();
}
void barA() { x+= 2; }
void barB() { x+= 3; }

is inlined to become

void foo() {
    x+= 2;
    x+= 3;
}

and eventually be optimized to

void foo() {
    x+= 5;
}

how should the profiler convey this information? (Of course, this is overly suggestive, but… it’s difficult anyhow)

[QUOTE=NigelEssence;127525]
Java performance is rather different compared with when I first started using it in 1999. All the myths have been debunked.[/QUOTE]

Certainly. And personally, I’m convinced that (already now, but even more in the future) „interpreted“ or Just-in-time-compiled languages will achieve a higher overall performace than statically compiled ones. The JVM simply has options that do not exist for statically compiled languages and libraries. This refers to runtime-compilation for different target hardware architectures, as well as things like The Black Magic of (Java) Method Dispatch

(But still we’re largely using C++ at work, „because of the performance“ … and so much precious time is wasted…)

Even more so when performance includes:

  • developer performance
  • library availability performance
  • concurrency support performance

and particularly, for me as a product vendor

  • easy deployment for everybody without having to compile different versions performance

Can you imagine the pain of using different compilation flags for different customers on different platforms? No wonder you often find the lowest common denominator.

I haven’t done any tests, but JavaFX seems pretty fast for 3D graphics.

I once sat at a board meeting where our marketing director suggested I rewrite all the 0.5M lines of Java into C to get better performance. In a few months.

These points are what my side note about the „wasted time“ referred to. There are fields where you should use C/C++, no doubt. For everything that is related to things that are not part of the JVM, C/C++ will always be the first (although not necessariliy the only) choice. This refers to Computer Graphics/OpenGL, using SDKs that are close to the drivers of specific hardware (Oculus Rift, Kinect, etc), and (yes: ) also to GPU computing like CUDA. But what bugs me is when great efforts are wasted by writing software in C++, the true reasons are kept back, and the mock reason is that „the performance of C++ is higher than that of Java“. That’s like saying „French is more honest than Spanish“. Ehm. No. These are just languages. It still depends on what you say.
However, I don’t want to pull this into a language bashing thread, so I’ll stop my rant here :wink:

JavaFX, by the way, has very limited support for 3D graphics. Of course, the cards today are so fast that you can throw quite some geometry at them and they still will render it smoothly - but the crucial point is that JavaFX does not support custom shaders (like GLSL shaders). It’s a rather thick abstraction layer around OpenGL, and for certain tasks, it is simply too thick. In this case, one would have to fall back to LWJGL or JOGL (or hope that they’ll extend JavaFX in that way, which is still being discussed)

@NigelEssence I have some difficulties responding to your mails. Your mail domain (server) seems to be offline. I’ll try again later…

I’m sorry my domain has been stolen, steps are being taken to deal with this theft and fraud. However I do not know how long it will take - I assume there is nothing urgent? I do have a gmail address which can be used, but lets hope my domain is retrieved soon. I am in communication with my ISP.

Some people lead sad lives…

It’s not urgent in that sense. I just tried to weave together some of the discussion threads that had been open from several mails.

I wonder how a domain can be “stolen”… but I guess it’s one of these odd things that can literally happen to everybody

Warning, gallows humor
[spoiler]
Maybe you did not make clear enough that this domain is your property… :stumm:
[/spoiler]

Some follow up - doing some test with the actual large data set - with lots of timing information - it turns out that


                cudaMalloc(d_A, nRows * nCols * Sizeof.DOUBLE);
                cudaMalloc(d_C, nCols * nCols * Sizeof.DOUBLE);

is taking a significant time. Each time i call gemm or dysrk, nCols may be changing. I typically have nRows = 150,000, nCols = 200. Is is possible to reuse the pointers and allocation, so that they are re allocated if the size INCREASES (and previous memory free’d) , but are re used if the size DECREASES (so tht the allocated memory is larger than the size of data used in calculations and data transfer) ? I will do some tests to check if it works. There may be other ways to do what I want - I have been doing some searching.

I do this kind of thing in my Java to reduce the creation of large double[][] arrays - it has a small but significant effect in Java, reduces all that garbage collection and allocation. It was a technique used in a lot of old (but still used) Fortran programs, where a large array was created at compile time, and split up into the necessary arrays at run time with suitable pointers. In those days, array sizes had to be defined at compile time…

See CUDA Optimization Techniques

*** Edit ***

ps. from the link above, allocating 100 Mbytes takes well over a second, as does freeing. In contrast, transferring 100 Mbytes takes less than 100 milliseconds.

First of all: Yes, it should be possible to re-use pointers that have been allocated to point to memory regions that are larger than the memory that is actually required.

(I once considered creating some slightly more object-oriented wrappers for JCuda, and of course, something like this (c/sh/w)ould include some sort of „GPUBuffer“ class that does this transparently - similar to an „ArrayList“).

For the memory optimization in general, there are obvious cases, e.g. you should usually not write

for (int i=0; i<large; i++) 
{
    Pointer pointer = new Pointer();
    cudaMalloc(pointer, size);
    workWith(pointer);    
    cudaFree(pointer);
}

when

Pointer pointer = new Pointer();
cudaMalloc(pointer, size);
for (int i=0; i<large; i++) 
{
    workWith(pointer);    
}
cudaFree(pointer);

will do the same. And for the case that the „size“ is not constant, it is probably a good idea to consider allocating the maximum required size only once, in the beginning. (Note that I said that „considering“ it is good, not that „doing“ it is good: There may always be cases where this is not appropriate)

However, the time for the allocations should usually not be the bottleneck, and particularly, they should not nearly be as high as you described. Regarding the site that you linked to: Note the information that is summarized at A Note About These Measurements :

A Note About These Measurements:

Unless otherwise noted, the data shown on this site were measured on a machine (Barracuda10) with […] the following software configuration:

Ubuntu 7.10 (64-bit)
NVIDIA driver version 177.67
**CUDA Toolkit version 2.0**
**CUDA SDK version 2.0 Beta2**

(e.b.m).

Oh, I remember these days. We were young. CUDA was new. Allocation was slow :smiley: (obviously)

I just scribbled down another test…

import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Locale;

import jcuda.Pointer;
import jcuda.runtime.JCuda;

public class AllocationBenchmark
{
    public static void main(String[] args)
    {
        int runs = 20;
        for (int size = 1; size <= 1 << 28; size <<= 1)
        {
            runTest(size, runs);
        }
    }

    private static void runTest(int size, int runs)
    {
        JCuda.setExceptionsEnabled(true);

        Pointer pointer = new Pointer();
        byte data[] = new byte[size];
        long before = 0;
        long after = 0;
        long totalAllocNs = 0;
        long totalFreeNs = 0;
        for (int i = 0; i < runs; i++)
        {
            before = System.nanoTime();
            cudaMalloc(pointer, size);
            cudaDeviceSynchronize();
            after = System.nanoTime();
            totalAllocNs += (after - before);

            cudaMemcpy(pointer, Pointer.to(data), 
                size, cudaMemcpyHostToDevice);
            cudaDeviceSynchronize();
            cudaMemcpy(Pointer.to(data), pointer, 
                size, cudaMemcpyDeviceToHost);
            cudaDeviceSynchronize();

            before = System.nanoTime();
            cudaFree(pointer);
            cudaDeviceSynchronize();
            after = System.nanoTime();
            totalFreeNs += (after - before);
        }

        double avgAllocMs = totalAllocNs / 1e6;
        double avgFreeMs = totalFreeNs / 1e6;

        System.out.printf(Locale.ENGLISH,
            "Size %14d alloc %12.4f ms  free %12.4f ms
", 
            size, avgAllocMs, avgFreeMs);
    }
}

and the timings are along the lines of


...
Size       33554432 alloc       5.9246 ms  free      12.7367 ms
Size       67108864 alloc       6.0273 ms  free      15.9438 ms
Size      134217728 alloc       6.7702 ms  free      32.1162 ms
Size      268435456 alloc       8.3200 ms  free      60.5944 ms

(on a GTX 970 with CUDA 7.5)

Although I’m a bit surprised to see that freeing is slower than allocating, the time is far away from 1 second…

[ot]
(I wonder whether the fact that I just wrote this down is an indication that by „benchmarking library“ is the sledgehammer that is supposed to be used for cracking a nut, but … I’ll probably continue to fiddle around with this sledgehammer, maybe it will become useful one day)
[/ot]

Well, as we say in U.K., the proof is in the pudding. Doing what i suggested above improves speed on my overall algorithm by over 30%, so you can imagine how much time the allocation and freeing was taking before, given that I am doing a lot of processing for each allocation.

My next bottleneck is rather slow log and exp functions in Java, there are some tricks to halve the number of these, but not important enough to code up a cuda kernel.

It’s so essential to have explicit timing figures for each significant section of the code, as we agree, profilers are a very crude and unreliable tool. The numbers didn’t add up until I timed the sections - as always, the bottlenecks are not always where you expect.

Maybe there is some hidden reasons why my allocations/freeing were taking a long time, but it is not an issue any more so i won’t be investigating. As far as I know i am using all the latest libraries.

Well, I didn’t say that it cannot be a bottleneck. Particularly, the 60ms for a cudaFree are still surprising for me - I mean, it does not really have to do anything there. For cudaMalloc, one could do some handwaving and say „It has to set stuff up, find free memory, allocate internal tracking mechanisms, whatever“. In fact, what’s even more unexplicable for me is that the times so heavily depend on the size of the memory blocks. In how far is allocating 1000 bytes more effort than allocating 999 bytes? (But admittedly, I don’t have a clue about how the memory management with malloc/free work, and even less the cudaMalloc/cudaFree inside the CUDA black box).

However, if some restructuring of the memory management does bring a speedup, then you’re obviously right about this pudding thing :wink:

Regarding the log/exp: I’m not sure whether I already mentioned it, but there is a small library at jcuda.org - Utilities for vector operations with JCuda. So if you want to do some element-wise operation on a vector, e.g. compute the log of all elements of a vector, then this might be possible with

VecDouble.log(size, input, output);

The goal of this library was exactly that: Avoiding the necessity to write an own kernel for some trivial mathematical bulk operation. It basically contains a bunch of pre-compiled PTX files for all these trivial kernels. One important shortcoming is that it does not support strided operations (e.g. „compute the log of every n-th element of this vector“). Adding these is already on my TODO list, but did not yet have high priority. At the moment, I’m working on „JOCLBlas“, JOCL-based bindings for the BLAS part of clMathLibraries · GitHub - I’m curious to see how their GEMMs perform compared to NVIDIAs CUBLAS, but there’s still some work to be done before that.

@NigelEssence

Just a short heads-up: I just pushed an initial version of JOCLBLAS to https://github.com/gpu/JOCLBLAS

Simple comparison of the FLOPS for a Sgemm (once with JCublas, and once with JOCLBLAS) does not look very promising in terms of performance…

Benchmark results
[spoiler]


benchmark | cols | rows  | iterations | GFLOPS     | avg.ms  | MB/s HtoD  | MB/s DtoH  | 
    Sgemm |  200 | 40000 |         50 |  756.70776 | 4.22885 | 2845.26855 | 2798.70239 | 
SgemmJOCL |  200 | 40000 |         50 |  669.12274 | 4.78238 | 2745.60156 | 2726.60913 | 
    Sgemm |  250 | 25600 |         50 |  638.10065 | 5.01488 | 2837.22876 | 2794.69263 | 
SgemmJOCL |  250 | 25600 |         50 |  793.85791 | 4.03095 | 2712.33130 | 2690.46387 | 
    Sgemm |  300 | 17777 |         50 | 1323.16248 | 2.41834 | 2826.13184 | 2795.17651 | 
SgemmJOCL |  300 | 17777 |         50 |  781.54242 | 4.09429 | 2694.78687 | 2723.56519 | 
    Sgemm |  350 | 13061 |         50 |  998.84589 | 3.20364 | 2809.91382 | 2783.36011 | 
SgemmJOCL |  350 | 13061 |         50 |  749.62042 | 4.26875 | 2728.29834 | 2743.68384 | 
    Sgemm |  400 | 10000 |         50 | 2221.35034 | 1.44057 | 2791.08643 | 2771.02197 | 
SgemmJOCL |  400 | 10000 |         50 |  725.81610 | 4.40883 | 2675.41724 | 2681.30737 | 
    Sgemm |  450 |  7901 |         50 | 1227.35168 | 2.60716 | 2788.84180 | 2774.18750 | 
SgemmJOCL |  450 |  7901 |         50 |  746.02606 | 4.28927 | 2644.92798 | 2682.72095 | 
    Sgemm |  500 |  6400 |         50 | 1062.76782 | 3.01101 | 2779.85767 | 2730.93726 | 
SgemmJOCL |  500 |  6400 |         50 |  841.04102 | 3.80481 | 2611.94556 | 2625.33936 | 
    Sgemm |  550 |  5289 |         50 | 1750.34875 | 1.82812 | 2802.30005 | 2779.72803 | 
SgemmJOCL |  550 |  5289 |         50 |  792.39508 | 4.03819 | 2622.79761 | 2643.74268 | 
    Sgemm |  600 |  4444 |         50 | 1330.41821 | 2.40502 | 2767.75464 | 2725.53857 | 
SgemmJOCL |  600 |  4444 |         50 |  762.01740 | 4.19896 | 2598.58179 | 2596.15625 | 
    Sgemm |  650 |  3786 |         50 | 2412.81152 | 1.32591 | 2766.19580 | 2752.69019 | 
SgemmJOCL |  650 |  3786 |         50 |  821.62775 | 3.89370 | 2609.78198 | 2597.44043 | 
    Sgemm |  700 |  3265 |         50 | 1846.03345 | 1.73328 | 2893.15894 | 2813.14136 | 
SgemmJOCL |  700 |  3265 |         50 |  926.34888 | 3.45410 | 2606.59619 | 2603.51611 | 
    Sgemm |  750 |  2844 |         50 | 1580.02527 | 2.02497 | 2733.06934 | 2723.69434 | 
SgemmJOCL |  750 |  2844 |         50 |  613.42462 | 5.21580 | 2477.55054 | 2530.30054 | 
    Sgemm |  800 |  2500 |         50 | 2223.13721 | 1.43941 | 2762.05078 | 2747.63965 | 
SgemmJOCL |  800 |  2500 |         50 |  812.36914 | 3.93910 | 2503.89331 | 2546.23364 | 
    Sgemm |  850 |  2214 |         50 | 1768.19531 | 1.80932 | 2766.73242 | 2719.52100 | 
SgemmJOCL |  850 |  2214 |         50 |  862.27576 | 3.71022 | 2616.78467 | 2544.56372 | 
    Sgemm |  900 |  1975 |         50 | 2471.87134 | 1.29436 | 2816.65308 | 2696.80762 | 
SgemmJOCL |  900 |  1975 |         50 |  937.53320 | 3.41268 | 2583.44409 | 2469.92651 | 
    Sgemm |  950 |  1772 |         50 | 1810.81567 | 1.76631 | 2807.92944 | 2666.83936 | 
SgemmJOCL |  950 |  1772 |         50 |  953.76654 | 3.35350 | 2495.50513 | 2473.54785 | 
    Sgemm | 1000 |  1600 |         50 | 1708.23010 | 1.87328 | 2724.58472 | 2677.20776 | 
SgemmJOCL | 1000 |  1600 |         50 |  772.83923 | 4.14058 | 2556.13867 | 2443.28369 | 

[/spoiler]

Of course, JCublas with its heavily-tuned optimizations for NVIDIA achieves a higher performance, but still the difference is larger than I expected. The advantages of the OpenCL-based version, however, could be

  • runs on AMD cards (and even CPUs!) as well
  • somewhat simpler “installation” (the natives are contained in the JARs - but this is on the roadmap for JCuda as well)

Some more details (downloads and a sample) (for the case that you want to give it a try nevertheless) are in the JOCLBLAS announcment thread