Matrix products

Hi. I just want to compare the results between a big matrix producs done in normal java:

package matric;

public class Main {


    public static void main(String[] args) {
        int [][] a = new int [1000] [1000];
        int [][] b = new int [1000] [1000];

        for (int i=0; i< a.length; i++){
            for (int j=0; j<a**.length; j++){
                    a**[j]= (int)(Math.random()*10);

        for (int i=0; i< b.length;i++){
            for (int j=0; j<b**.length;j++){
                    b**[j]= (int)(Math.random()*10);

        for (int i=0; i< a.length; i++){
            for (int j=0; j<a**.length; j++){
                    System.out.print(a**[j]+" ");


        for (int i=0; i< b.length; i++){
            for (int j=0; j<b**.length; j++){
                    System.out.print(b**[j]+" ");

        int [][] c = new int [a.length] [b[0].length];

        long begin = System.currentTimeMillis();
        for (int i=0; i< a.length; i++){
            for (int j=0; j<b[0].length;j++){
                for (int k=0; k<a[0].length;k++){
                    c**[j] += a**[k] * b [k] [j];

        for (int i=0; i<c.length; i++){
            for (int j=0; j< c[0].length;j++){
                System.out.print(c**[j]+" ");
        System.out.println("Java: "+(System.currentTimeMillis()-begin)/1000+" seconds");

And the same using Jcuda but i’m not capable to start. I use Windows XP Professional SP3 (32 bit) with NetBeans 6.9 and CUDA 3.2 with JCuda 0.3.2a. Any help ?


There are several threads about matrix multiplication. A search may bring up some of them, e.g. this one.

But before you try it, you should definitely provide more information about your approach (first of all: are you going to write an own kernel, use the NVIDIA example, or CUBLAS?), and, if possible, more specific error messages.


Thanks for the reply Marco. Which are the difference between a kernel, use an Nvidia example or CUBLAS ? I just want to do the same code that i’ve posted, and use jcuda but i’m not capable from where i should start. Then i want to compare timing results between java and jcuda. No error message (for now).


When you just want to do a Matrix Multiplication with CUDA, then you can use CUBLAS (i.e. JCublas). It is a library for general matrix operations. So you don’t have to write your own CUDA code, but can simply use the library. Writing an own kernel would be far more difficult - and in any case, the kernel would not be as fast as that of CUBLAS.

The first example from , namely “”, performs a matrix multiplication (called “Sgemm” in CUBLAS). This may serve as a starting point.

Accurately measuring and comparing the speed of Java and JCublas may be tricky, but for a 1000x1000-Matrix, JCublas will be much faster anyhow.


Ok. I’ll try to do something. If i have problems, i’ll post them here. Thanks.

Hi Marco. I’ve saw that is not possible to do a simple m x m matrix multiplication with CUBLAS (and JCublasSample demonstrate this). There are many functions in CUBLAS but not what i want. How can i do ?


The “sgemm” operation of the sample is computing
C = alpha * A * B + beta * C
for Matrices A, B and C. When you set alpha=1 and beta=0, you have
C = A * B
and thus a simple multiplication.

If you want to create your own kernel anyhow, you should have a look at the post that I linked, or at the “” to see how this can be done in general. But writing a good, efficient matrix multiplication for arbitrary matrix sizes is not as trivial as it might look at the first glance. If you only want to use the matrix multiplication (and do not want to learn CUDA programming and play around with a simple matrix multiplication kernel to get familiar with it) you should definitely use JCublas.


Ok, thanks for your help.

A couple of comments.

Your Java code is VERY inefficient. For example, the line

c**[j] += a**[k] * b [k] [j];

jumps around the first index of b, which is very inefficient in terms of memory access. Java 2D arrays are 1D arrays of 1D arrays.

You can also get big improvements by storing local copies, unwrapping the code, etc etc etc.

But more importantly, any comparison should be between CUBLAS and a multi threaded Java implementation. I am sick of all these comparisons saying GPU’s are so wonderful when compared with a single threaded Java version.

In my tests, done a month or two ago, DGEMM (double precision version of SGEMM) in JCUBLAS was roughly the same speed as a quad core Java 8-threaded implementation for size of about 1000 x 1000. I was multiplying a matrix by its transpose, and creating an upper triangular system (of a symmetric matrix).

In short, only worth the effort if you have bigger matrices, or can do a sequence of tasks keeping data in GPU memory. But fun to try out for yourself.

Hello Essence,

You’re probably right with most of your comments. And indeed, Benchmarks tend to be „biased“, depending on what the person executing the benchmark wants to propose :wink:

When you have something like

float A[] = ...
float B[] = ...
float result[] = ...
multiply(A,B, result);

and you have to implement the „multiply“-Method, there are at least two possibilities:

  1. You could start out writing an memory-layout-optimized efficient N-Thread Java implementation of the Coppersmith–Winograd algorithm
  2. You could use JCublas
    (Yes, this statement may also be „biased“, but I think the point is clear :wink: )

But seriously, I think that JCublas should not primarily be considered as a „drop-in-replacement“ for a Java Matrix Library. (It could be, in some cases). It can, in any case, be used when someone has a sequence of complex computations that are already done with CUDA on the GPU, and one step of this sequence is a matrix operation. That’s what CUBLAS mainly was intended for.

In fact, I know that the real world application areas for JCublas are rather limited, compared to a plain Java implementation. And this for several reasons. The main reason may be the lack of platform independence: You need the native libraries for CUBLAS, which implies that you need a CUDA Toolkit installed, which implies that you have an NVIDIA card - it could be tough to explain that to your customer who just spent 1000 bucks for some High-End AMD Radeon card… :o On the other hand: A bunch of Fermi cards will outperform any CPU out there, at least in terms of pure FLOPS.

Apart from that, one could justify this limited applicability (outside the academic world) by mentioning that GPGPU is rather young compared to other technologies. But things like OpenCL will literally bring GPGPU to the masses - in an abstract form - namely as „heterogeneous computing“. One should always keep in mind that the number of processor cores will double every 18 months, and this will cause a „concurrency revolution“, which most likely can not be handled by simply spawning a few hundred Java Threads - except when Oracle anticipates this development, and tailors the JVM to cope with these new achitectures. I’m not sure in how far this is really possible, but I hope that they will quickly tackle this and find a good solution. Things like the upcoming Fork/Join Framework in Java 7 might be a first step.


Just for the record, this is my optimised version of mutiplying a matrix by its transpose. This is considerably faster than any other Java code I found on the web. No doubt there are some moew tweaks, but it’s good enough for me. I didn’t find any Java matrix library which did a decent job of handling symettric matrices.

Of course, I have other code for cholesky etc. So, as Marco says, if I string along this code with a cholesky and then solving multiple rhs’s, it become a good candidate for JCUBLAS - but wait, does CUBLAS handle upper triangular?

As Wirth says, data structure + algorithm = program

Note I pass in the connection pool, and this pool is available for many different tasks. I call this methods thousands of times.

    public static double[] generateXTXThreadedFuture(final double[][] xtest, final double[] yupperrow, final int nrow, final int nruns, final int numTasks, final CompletionService<String> pool) {
        for (int interval = numTasks, end = nrow, size = (int) Math.ceil(nrow * 1.0 / numTasks);
                interval > 0;
                interval--, end -= size) {
            final int to = end;
            final int from = Math.max(0, end - size);
            final Runnable runnable = new Runnable() {

                public void run() {
                    double xymult;
                    double xymult1;
                    double xymult2;
                    double xymult3;
                    int yindex = from * nrow - from * (from - 1) / 2;
                    final double[] tmparray = new double[nruns];
                    for (int i = from; i < to; i++) {
                        System.arraycopy(xtest**, 0, tmparray, 0, nruns);
                        int j;
                        for (j = i; j < nrow - 2; j += 3) {
                            xymult1 = 0;
                            xymult2 = 0;
                            xymult3 = 0;
                            final double[] tmpcol1 = xtest[j];
                            final double[] tmpcol2 = xtest[j + 1];
                            final double[] tmpcol3 = xtest[j + 2];
                            for (int k = 0; k < nruns; k++) {
                                xymult1 += tmparray[k] * tmpcol1[k];
                                xymult2 += tmparray[k] * tmpcol2[k];
                                xymult3 += tmparray[k] * tmpcol3[k];
                            yupperrow[yindex++] = xymult1;
                            yupperrow[yindex++] = xymult2;
                            yupperrow[yindex++] = xymult3;
                        for (; j < nrow; j++) {
                            xymult = 0;
                            final double[] tmpcol = xtest[j];
                            for (int k = 0; k < nruns; k++) {
                                xymult += tmparray[k] * tmpcol[k];
                            yupperrow[yindex++] = xymult;
            String result = null;
            pool.submit(runnable, result);
        try {
            for (int i = 0; i < numTasks; i++) {
                String result = pool.take().get();
        } catch (final InterruptedException e) {
        } catch (final ExecutionException e) {

        return yupperrow;


It’s a little bit hard to grasp this with a single glance - it looks like the inner loop is partially “unrolled” (assuming a matrix size of 3*n) ? I wouldn’t have expected that it could not be optimized by avoiding the allocation of the ‘temp’ array, but presumably, the JIT is optimizing it quite well internally. (And this has to be admitted: It’s slighty ‘tweaked’, but certainly much easier than anything that you could find in the depths of the CUBLAS sgemm implementation…)

I did a lot of benchmarking, trying out various tweaks, and I don;t understand how the JVM works either! This was all done on the client hotspot JVM, 64 bit, Windows.

This is the wrong forum to talk about optimising Java, but thought it would be of interest to those comparing CUDA with Java - make sure you compare optimised with optimised.

The unrolling is common in BLAS, works well in Fortran and C, but it didn’t have much effect in Java, something to do with you can only execute one Java byte code at a time, whereas C/Fortran can pipeline ? I think you will find that for a pure vector vector product, raw C/Fortran optimised is significantly faster than Java optimised, but for most other operations there is less difference.

ps it doesn’t assume a size of 3*n, it cleans up after the unrolled loop. System.arraycopy is very fast. Be nice if Java had an optimised method for vector product, it is such a prevalent operation.

I assume that you evaluated all the possibilities and found the optimal solution for you in this Java Code. In different enviroments, other configurations may have advantages. I think that I already mentioned in another thread that, for example, the same benchmark with float instead of double might bring different results, and of course, everything depends on the matrix size, memory transfer overheads etc. etc…

We can talk about optimizing Java, of course, JCuda and JOCL are solely about performance. Again, they are not „silver bullets“, but there certainly are possible applications. (And alternatives: Aparapi or java-gpu are bringing Java and the GPU closer together).

But even without additional libraries, the HotSpot VM has the nice property of bringing a free improvement with every new version :slight_smile: I assume that Oracle will incorporate the best of both worlds when HotSpot and JRockit are merged, and chances are high that the latter includes improvements in terms of scalability for large numbers of threads.

FYI, here is the Jama matrix multiply code, which has temporary storage and even constructs a new object Bcolj. Colt, from a quick glance, doesn’t seem to do anything clever with matrix multiply.

  public Matrix times (Matrix B) {
      if (B.m != n) {
         throw new IllegalArgumentException("Matrix inner dimensions must agree.");
      Matrix X = new Matrix(m,B.n);
      double[][] C = X.getArray();
      double[] Bcolj = new double[n];
      for (int j = 0; j < B.n; j++) {
         for (int k = 0; k < n; k++) {
            Bcolj[k] = B.A[k][j];
         for (int i = 0; i < m; i++) {
            double[] Arowi = A**;
            double s = 0;
            for (int k = 0; k < n; k++) {
               s += Arowi[k]*Bcolj[k];
            C**[j] = s;
      return X;

This is a good resource:

The links include Volkov’s paper which gives the kernel for matrix multiply - it is surprisingly short and simple, is this really all it is, and is it this which is in CUBLAS2 ?

Is there any reason to call CUBLAS SGEMM rather than this simple kernel ?

I think even I might have a chance to adapt this kernel to the XT * X problem, if I ignore the symmetry.

DGEMM and SGEMM are fundamental because they are used extensively by other methods.


As far as I know, it’s kind of a „company secret“ which kernels are actually used in CUBLAS - maybe it’s possible to find some hints (by analyzing possible acknowledgements or forum threads - people occasionally propose new kernels in the NVIDIA forum, and some of them might have made their way into the CUDA runtime libraries). But in the end, the source code will probably not be available so soon…

One has to emphasize that CUBLAS is a BLAS implementation, and although a SGEMM can be used to emulate a matrix multiplication, it is in fact more powerful. It is not unlikely that it internally simply checks the factors, and if it finds out that the factors (alpha and beta) are 0.0 and 1.0, then forwards the computation to a „simple“ matrix multiplication instead of doing a „real“ SGEMM, but … nobody knows.

In any case, as long as the proposed kernel is not really significantly faster than the original SGEMM for a simple multiplication, there is at least one reason to use CUBLAS: It’s far simpler to call one method than to write and launch an own kernel :wink:

Apart from that: It is also not unlikely that CUBLAS uses different implementations of kernels, which are specifically optimized and tailored for the different hardware models and CUDA compute capabilities (e.g. for Fermi there may be a different kernel than for an „old“ GeForce 8xxxx). But again, nobody knows…