JCublas2 Matrix Multiplication Problem


first off, great library! However I am currently facing issues when multiplying matrices that are quite big (>144x144).

Some infos about my system:

  • Windows 7 64 bit
  • Java 7 64 Bit (1.7.0_03)
  • CUDA v4.1 (64bit) with nvidia driver 295.73 on a GTX580
  • JCUDA *0.4.1 (64bit)

Here is my code: https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/math/cuda/JCUDAMatrixUtils.java

Let me just explain it a bit to you.
I am currently developing a collaborative filtering algorithm based on ml-class.org implementation. This involves some heavy multiplication of large matrices, so I guess CUDA/CUBLAS would be a bit of improvement.
On CPU everything works as expected, but with CUDA on larger matrices I get problems.

I have written a “test-case” in the main method were I am passing my own class (DenseDoubleMatrix, row major format aka a plain double[][]) filled with random doubles. To make it a bit more easier, I passed quadratic matrices.
The whole CUDA stuff works like this: (how I think it should)
[li]transform my matrix into column major format in a single double array
[/li][li]I am going to allocate device memory for input matrix A.
[/li][li]use JCublas2.cublasSetMatrix(…) to “write” it to the device memory.
[/li][li]do the same stuff with the other matrix
[/li][li]allocate device memory for the output matrix
[/li][li]call cublasDgemm with the parameters it needs and synchronize the device
[/li][li]retrieve the result matrix with cublasGetMatrix and unfold it back to my own matrix class

This works fine. So I don’t think there is a major problem, however when I increase the size of the matrix I get sporadic NaN values in the output matrix.
The sad thing is, that no error is thrown.

I execute a multiplication on CPU with the standard 3-loop method and then execute with JCublas.
It is normal that there are some rounding errors within the solution of the CPU and the GPU, so I just took the a difference of both matrices and sum the absolute element values.
Small differences are not a big deal for me, however NaN’s are serious problems.

I have prepared you a sample output: (// are comments)

// caught the device an error?
no error
// using 2x2 results in difference of 0.0
2 0.0
no error
3 3.3306690738754696E-16
no error
4 2.7755575615628914E-16
no error
// no problems so long..
143 2.63792543364616E-10
no error
144 2.6637181349542516E-10
no error
// BAM not working anymore
145 NaN
no error
146 NaN
no error
147 NaN
no error
148 NaN
no error
149 NaN
no error
150 NaN
no error
151 NaN
no error
152 NaN
no error
153 NaN
no error
154 NaN
no error
155 NaN
// rest NaN errors omitted...

The worst thing is that it is not always breaking at 144x144, sometimes just at 244x244 or even at 312x312.
And as you can see, there are no errors recorded.
Debugging through the matrices reveals that there are only small and arbitrary amounts of elements NaN.

Do I have a hardware failure or am I missing something obvious?

Thanks, if you need additional information I be glad to hand them to you.


Nothing obvious here. I just tried it, I had to convert all occurances of Double/DOUBLE/double to Float/FLOAT/float respectively (because I’m still on an old GeForce 8800 here), and cublasDgemm to cublasSgemm, and at let it run through all sizes between 2 and 660, and finally tested it with 2000x2000, and did not receive a single NaN. I can try to run another test on a newer card with Double support as soon as I have the chance.

If you have a Visual Studio installed, you may want to adopt the simpleCUBLAS sample from NVIDIA to perform a similar test.

According to the output that you posted, it always spill out NaNs after it happened once. Is this true? (Or does it happen that it works correctly later?). A heat problem could explain the observation to some extent (especially if the GPU is overclocked). It might be the case that the card heats up during the computation of the first matrices, and at an unspecified point it is so hot that it causes errors. Just as a test: You mentioned that the error sometimes occurs at matrix sizes of >300. What happens when you run only a single test, with a size of 500x500 (that would “certainly” have caused an error otherwise) ? There are also tools for monitoring the GPU temperature. But of course, these are just guesses, because it might explain the symptoms…


Hi Marco,

these are very good points!

If you have a Visual Studio installed, you may want to adopt the simpleCUBLAS sample from NVIDIA to perform a similar test.

I have access to MSDNAA or what it is called now, I can download it. I try the simpleCUBLAS later this week then. To test this a few days ago I used the precompiled binaries shipping with the sample and it worked correctly.

Is this true? (Or does it happen that it works correctly later?)

This is actually funny, I thought that this will be a persistent state, but actually after a few matrix multiplications resulting in NaN it seems to recover.

no error
2 0.0
no error
3 2.220446049250313E-16
no error
4 6.661338147750939E-16
no error
// fine so far
206 9.311307280768233E-10
no error
207 9.489724561717594E-10
no error
208 9.606537787476555E-10
no error
209 NaN
no error
210 NaN
no error
211 NaN
no error
// NaN still persists
247 NaN
no error
248 NaN
no error
249 NaN
no error
250 NaN
no error
251 NaN
no error
252 NaN
no error
253 NaN
no error
254 NaN
no error
255 NaN
no error
256 NaN
no error
257 2.0115464849368436E-9
no error
258 2.0368560171846184E-9
no error
259 2.0725607896565634E-9
no error
260 2.0876029793726048E-9
no error
// no error until 300

What happens when you run only a single test, with a size of 500x500 (that would “certainly” have caused an error otherwise) ?

A single test seems not to be a problem. However if I run multiple tests (300) I get NaN results between iteration 160 and 245, afterwards it is going to work well.

When running a test from 150 to 3000 quadratic matrix sizes the error sporadically happens between ranges.

A heat problem could explain the observation to some extent (especially if the GPU is overclocked).

Yep, that was what I first thought as well. I am using a EVGA 580 SC (superclocked) it has slightly higher clocks

card / clock / memory / shader
580SC / 797/2025/1594
580 / 772/2004/1544

However I have no temperature problem. It is still running between 40°C and 60°C all the time, this is normal temperature and the fan keeps under 50%. I verified this thoughout 3 tools including EVGA Precision, RMClock and NVIDIA Inspector.

VRAM Usage seems to be increasing correctly according to the allocation. GPU Usage shows that the card remains quite IDLE 1-2%.

Do you think that I can DOS (in terms of Denial of Service) my graphic card?

Okay so I guess the next things to consider are:
[li]use c/c++ code to verify that the problem is not related with java/jni/jcuda
[/li][li]clock my card back to normal clocks
[/li][li]exchange doubles with floats maybe this is a bug in their double implementation

A voltage drop can be also causing the problem, but this seems to be very unlikely bceause I would have seen this on my monitoring tools.

Thanks Marco for your ideas, maybe you still have others. I try to verify the things throughout the week and tell you my results.

OK, I’ll try to run a test with Double values as soon as possible, to see whether I can reproduce the error.

I just replaced floats with doubles and am currently processing matrices from 2x2 to 2000x2000 without problems of NaN.
So I guess this seems to be a double problem.

I try to verify this by C code then.

I just ran a quick test on a GTX 560 Ti with ‘double’, and it also produces NaNs after a while (although later, at sizes of >500). I think it’s unlikely that it is related to JCuda, but you never know… The behavior is really strange, and I can not think of a profound explaination. It will be interesting to see whether the simpleCUBLAS example shows a similar behavior.

I also did some websearches yesterday, but did not find any hints that there may be a bug. The release notes of CUDA 4.0 contained an entry

  • In the previous release of the CUBLAS Library, the cublasDgemm() routine produced incorrect results in some cases when k < 32 and matrix A is transposed. This has been fixed in this release.

( http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/CUDA_Toolkit_Release_Notes.txt )
But this is rather unspecific, and seems not to be related to this error…

Great that you have the same problem^^

I think this is unrelated to JCuda as well, since (as far as I have seen it) just uses JNI to speak with the libraries.

I guess we should transfer this issue to the nvidia forums.

It’s not unlikely that at the NVIDIA forum, the first guess will be that it is related to JCuda :wink: Did you have the chance to adopt the simpleCUBLAS sample, so that the request at the NVIDIA forum may be supported with a compileable and testable sample? (I can try to set up such a sample, but I don’t always have access to a GPU with double support…)

Yeah I’m downloading visual studio right now.

I’ve seen a oracle forum discussion about JNI and double NaN’s

Do you think there is a problem converting from double to jdouble in jni code?

Hey Marco,

I have verified that cublas is working with doubles without problems in c++ with 2x2 to 2000x2000 matrices.

I have uploaded you my visual studio project and binary here:


You find the binary in x64\Release\cublas.exe

Hope you can verify this as well if you have access to the fermi card.

OK, I ran another short test, and can confirm that it does not seem to have this error in the native simpleCUBLAS example. I changed it to use double, and basically converted the ‘main’ into a test run where the matrix size ‘N’ is increased in a loop. I also changed the PointerMode to ‘device’, just to be sure that it does (as far as possible) the same as the JCublas example. (BTW: alpha and beta are not freed in your version)

On the one hand: Thank you for pointing out this bug. On the other hand: Sigh :frowning: I don’t even have the slightest Idea what might be the reason for this, have never before encountered such an error, have only limited possibilities for testing it with ‘double’ values at all, and debugging this will be hard because it does not really occur in a reproducable setup…

I only did another quick test with the JCublas version, and changed the PointerMode from ‘device’ to ‘host’. In this case, it ran up to a size of 1646 without an error (then it ran out of memory). This might be an indication that it might be related to the PointerMode, but I’ll have to invest more time for this and see whether I can find out what’s wrong there… -_-

Yeah, that is bad.

Well I can help you though. How did you compile these .dll’s? (JCublas2-windows-x86_64.dll)
Using the cygwin gcc or mingw?

I only did another quick test with the JCublas version, and changed the PointerMode from ‘device’ to ‘host’. In this case, it ran up to a size of 1646 without an error (then it ran out of memory).

Can you show me what I have to change to verify this? When setting the handle to


I have an problematic frame problem:

> # JRE version: 7.0_03-b05
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (22.1-b02 mixed mode windows-amd64 compressed oops)
> # Problematic frame:
> # C  [cublas64_41_28.dll+0x8eff9]

> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  jcuda.jcublas.JCublas2.cublasDgemmNative(Ljcuda/jcublas/cublasHandle;IIIIILjcuda/Pointer;Ljcuda/Pointer;ILjcuda/Pointer;ILjcuda/Pointer;Ljcuda/Pointer;I)I+0
> j  jcuda.jcublas.JCublas2.cublasDgemm(Ljcuda/jcublas/cublasHandle;IIIIILjcuda/Pointer;Ljcuda/Pointer;ILjcuda/Pointer;ILjcuda/Pointer;Ljcuda/Pointer;I)I+24
> j  de.jungblut.math.cuda.JCUDAMatrixUtils.multiply(Lde/jungblut/math/DenseDoubleMatrix;Lde/jungblut/math/DenseDoubleMatrix;)Lde/jungblut/math/DenseDoubleMatrix;+174
> j  de.jungblut.math.cuda.JCUDAMatrixUtils.main([Ljava/lang/String;)V+48
> v  ~StubRoutines::call_stub

> (BTW: alpha and beta are not freed in your version)

Oops, sorry. Hope to not leaked your memory with 4000 doubles :p

The DLLs are compiled with Visual Studio. I’ll have to check whether there are compile options that may influence this behavior, but as long as I don’t have an idea where the error might come from, it’s just helpless guessing…

When switching to HOST pointer mode, alpha and beta have to be host pointers:

Pointer alpha = Pointer.to(new double[]{ 1.0 });
Pointer beta = Pointer.to(new double[]{ 1.0 });
cublasDgemm(...alpha, ... beta...);

But even if it really worked (for you as well) this would not explain the behavior.

I think I’ll have to take a vacation in order to have enough time to go though all the options. The first thing would be to see where the errors occur in the matrix. During my tests, I did not compare the result to the host version, but instead, only checked for ‘NaN’ values in the matrix (which is much faster, of course) and it seemed that when the error occurred, it always was at position (0,0). Another thing I wanted to test is whether the memory really reached the GPU properly (by copying it back and comparing it to the input), but I never have seen errors with memory copies, so it’s unlikely that there is a problem (and it would not explain why it only happens with ‘double’ values).

When switching to HOST pointer mode, alpha and beta have to be host pointers:

Yep that fixes it, thanks for the explanation :wink:

That is a really strange problem, if you need help to verify a bit, tell me.

Did it also work for you with Host pointers?

Yes, it worked.

OK, it’s becoming increasingly confusing. I just tried to create a minimal sample to reproduce the error, in order to file a bug report, and in this example, the error does not seem to occur any more:

package jcuda.bugs;

import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasDgemm;
import static jcuda.jcublas.JCublas2.cublasGetVector;
import static jcuda.jcublas.JCublas2.cublasSetPointerMode;
import static jcuda.jcublas.JCublas2.cublasSetVector;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.jcublas.cublasPointerMode.CUBLAS_POINTER_MODE_DEVICE;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Random;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.runtime.JCuda;

public class Bug1_JCublasDgemmNaN {

	static boolean runTest(int N)
	    double h_A[];
	    double h_B[];
	    double h_C[];
	    Pointer d_A = new Pointer();
	    Pointer d_B = new Pointer();
	    Pointer d_C = new Pointer();
	    Pointer alpha = new Pointer();
	    Pointer beta = new Pointer();
	    int n2 = N * N;
	    int i;
	    cublasHandle handle = new cublasHandle();

	    System.out.printf("CUBLAS test %d running...
", N);

	    h_A = new double[n2];
	    h_B = new double[n2];
	    h_C = new double[n2];
	    Random random = new Random(0);
	    for (i = 0; i < n2; i++) {
	        h_A** = random.nextDouble();
	        h_B** = random.nextDouble();
	        h_C** = random.nextDouble();

	    cudaMalloc(d_A, n2 * Sizeof.DOUBLE);
	    cudaMalloc(d_B, n2 * Sizeof.DOUBLE);
	    cudaMalloc(d_C, n2 * Sizeof.DOUBLE);
	    cublasSetVector(n2, Sizeof.DOUBLE, Pointer.to(h_A), 1, d_A, 1);
	    cublasSetVector(n2, Sizeof.DOUBLE, Pointer.to(h_B), 1, d_B, 1);
	    cublasSetVector(n2, Sizeof.DOUBLE, Pointer.to(h_C), 1, d_C, 1);

		cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);

		cudaMalloc(alpha, 1 * Sizeof.DOUBLE);
		cudaMemcpy(alpha, Pointer.to(new double[]{1.0}), 1*Sizeof.DOUBLE, cudaMemcpyHostToDevice);

		cudaMalloc(beta, 1 * Sizeof.DOUBLE);
		cudaMemcpy(beta, Pointer.to(new double[]{0.0}), 1*Sizeof.DOUBLE, cudaMemcpyHostToDevice);

	    cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
	    h_C = new double[n2];
	    cublasGetVector(n2, Sizeof.DOUBLE, d_C, 1, Pointer.to(h_C), 1);

		boolean result = true;
		for (i=0; i<N*N; i++)
			if (Double.isNaN(h_C**))
				System.out.printf("Error for size %d at %d
", N, i);
				result = false;


	    return result;

	public static void main(String args[])
		for (int i=2; i<2000; i++)
			boolean result = runTest(i);
			if (result != true)


The 1:1 corresponding C implementation, just for reference:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <cmath>
#include <cuda_runtime.h>
#include <cublas_v2.h>

static int runTest(int N)
    cublasStatus_t status;
    double* h_A;
    double* h_B;
    double* h_C;
    double* d_A = 0;
    double* d_B = 0;
    double* d_C = 0;
    double *alpha;
    double *beta;
    int n2 = N * N;
    int i;
    cublasHandle_t handle;

    printf("CUBLAS test %d running...
", N);


    h_A = (double*)malloc(n2 * sizeof(h_A[0]));
    h_B = (double*)malloc(n2 * sizeof(h_B[0]));
    h_C = (double*)malloc(n2 * sizeof(h_C[0]));
    for (i = 0; i < n2; i++) {
        h_A** = rand() / (double)RAND_MAX;
        h_B** = rand() / (double)RAND_MAX;
        h_C** = rand() / (double)RAND_MAX;

    cudaMalloc((void**)&d_A, n2 * sizeof(d_A[0]));
    cudaMalloc((void**)&d_B, n2 * sizeof(d_B[0]));
    cudaMalloc((void**)&d_C, n2 * sizeof(d_C[0]));
    cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);
    cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);
    cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);

	cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);

	double alphaValue = 1.0;
	cudaMalloc((void**)&alpha, 1 * sizeof(double));
	cudaMemcpy(alpha, &alphaValue, 1*sizeof(double), cudaMemcpyHostToDevice);

	double betaValue = 0.0;
	cudaMalloc((void**)&beta, 1 * sizeof(double));
	cudaMemcpy(beta, &betaValue, 1*sizeof(double), cudaMemcpyHostToDevice);

    status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
    if (status != CUBLAS_STATUS_SUCCESS) {
        fprintf (stderr, "!!!! kernel execution error.
        return EXIT_FAILURE;
    h_C = (double*)malloc(n2 * sizeof(h_C[0]));
    cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

	int result = EXIT_SUCCESS;
	for (i=0; i<N*N; i++)
		if (h_C** != h_C**)
			printf("Error for size %d at %d
", N, i);
			result = EXIT_FAILURE;

    /* Memory clean up */

    return result;

int main(int argc, char** argv)
	for (int i=2; i<2000; i++)
		int result = runTest(i);
		if (result != EXIT_SUCCESS)

Can you confirm that the Java example works without producing NaNs? (This would at least give a hint how the bug may be tracked - namely by carefully looking for any possible difference to the original program where the problem sitll occurred…)