Matrix Multiplication between CPU and GPU are different

Hello, I’m using jcuda and jcublas (version 9.2) for matrix operation.
Jama library uses cpu,
Jcublas library uses gpu

Theoretically, those matrix multiplication results are same,
but, They are different.

Do you know reason why those are different,
And how to fix(making same results) it?

Thank you,

here is test code (I used 3x3 matrix)

import Jama.Matrix;
import jcuda.*;
import jcuda.jcublas.*;

public class MatrixMultiplicationTest {

public static void main(String[] args) {

double alpha = 1.0;
double beta = 0.0;

double[][] dummy = { { 9, 8, 7 },
{ 6, 5, 4 },
{ 3, 2, 1 } };

// Unit Matrix
double[][] e = { { 1, 0, 0 },
{ 0, 1, 0 },
{ 0, 0, 1 } };

int row = dummy.length;
int col = dummy[0].length;

double h_C[] = new double[row * col];
h_C[0] = 1;
h_C[1] = 0;
h_C[2] = 0;
h_C[3] = 0;
h_C[4] = 1;
h_C[5] = 0;
h_C[6] = 0;
h_C[7] = 0;
h_C[8] = 1;

/* Initialize JCublas */
JCublas.cublasInit();


double h_M[] = new double[row * col];
// Column major
h_M[0] = 9;
h_M[1] = 6;
h_M[2] = 3;
h_M[3] = 8;
h_M[4] = 5;
h_M[5] = 2;
h_M[6] = 7;
h_M[7] = 4;
h_M[8] = 1;

JCublas.cublasInit();

/* Allocate host memory for the matrices */
Pointer d_M = new Pointer();
Pointer d_C = new Pointer();

/* Allocate device memory for the matrices */
JCublas.cublasAlloc(row * col, Sizeof.DOUBLE, d_M);
JCublas.cublasAlloc(row * col, Sizeof.DOUBLE, d_C);

/* Initialize the device matrices with the host matrices */
JCublas.cublasSetVector(row * col, Sizeof.DOUBLE, Pointer.to(h_M), 1, d_M, 1);
JCublas.cublasSetVector(row * col, Sizeof.DOUBLE, Pointer.to(h_C), 1, d_C, 1);

Matrix matrixM = new Matrix(dummy);
Matrix matrixC = new Matrix(e);

int count = 0;
while (count < 34) {

JCublas.cublasDgemm('n', 'n', 3, 3, 3, alpha, d_C, 3, d_M, 3, beta, d_C, 3);

matrixC = matrixC.times(matrixM);
count++;

}

/* Read the result back */
JCublas.cublasGetVector(row * col, Sizeof.DOUBLE, d_C, 1, Pointer.to(h_C), 1);

double[] x = matrixC.getColumnPackedCopy();
   
System.out.println(x[0] - h_C[0] + " ");
System.out.println(x[3] - h_C[3] + " ");
System.out.println(x[6] - h_C[6] + " ");

System.out.println(x[1] - h_C[1] + " ");
System.out.println(x[4] - h_C[4] + " ");
System.out.println(x[7] - h_C[7] + " ");

System.out.println(x[2] - h_C[2] + " ");
System.out.println(x[5] - h_C[5] + " ");
System.out.println(x[8] - h_C[8] + " ");

/* Memory clean up */
JCublas.cublasFree(d_M);
JCublas.cublasFree(d_C);

/* Shutdown */
JCublas.cublasShutdown();
}
}

JAMA : JAMA: Java Matrix Package
I refered to : Floating Point and IEEE 754 :: CUDA Toolkit Documentation

Is it only a small „machine epsilon error“, like 1e-40 or so, or are the results completely wrong?

(That is: What is the output of the program in your case?)


I just tried it out (with the latest version 11.4.1), and added an output

System.out.println(Arrays.toString(x));
System.out.println(Arrays.toString(h_C));

and it printed

0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
[6.428680341468594E40, 4.1251216960126337E40, 1.8215630505566785E40, 5.419829662200302E40, 3.4777677129188016E40, 1.5357057636373047E40, 4.41097898293201E40, 2.830413729824969E40, 1.2498484767179306E40]
[6.428680341468594E40, 4.1251216960126337E40, 1.8215630505566785E40, 5.419829662200302E40, 3.4777677129188016E40, 1.5357057636373047E40, 4.41097898293201E40, 2.830413729824969E40, 1.2498484767179306E40]

An aside: The JCublas class is the „legacy CUBLAS API“, as it is described at cuBLAS :: CUDA Toolkit Documentation . You might consider justing JCublas2 instead. An example for a SGEMM with the CUBLAS (v2) API is at jcuda-samples/JCublas2Sample.java at master · jcuda/jcuda-samples · GitHub

1 Like

Thanks for your answer! In my case the cpu and gpu value was different,
Here is the different result(I attached link)

And Now I’m guessing there exists floating point problems, specially the associative rule when compute floating point

Additional Question : Is Jcublas2 is more accurate than Jcubals?
If it is, Why?

I think that there should not be any difference in the accuracy between CUBLAS and CUBLAS v2. It is only a slightly different API, and CUBLAS2 (i.e. JCublas2) offers more functionality. When you only need DGEMM, then this may not matter.


As Robert Crovella pointed out on stack overflow, there was a small issue with the original code, but that was fixed now.

But as he also mentioned: Such small differences are „normal“, and something that one has to cope with when doing floating point computations. Based on the image that you posted there, the error (i.e. the difference between the values) seems to be roughly in the order of 0.000000000000000000000000000000000000000000000000000000001. This is just an epsilon-error, and nearly impossible to avoid in general.

(It might even depend on the GPU that you are using: It might be that your GPU does not even support double computations, and the computations have to be emulated based on float…)

1 Like