[JCUDA, JCUBLAS] How to synchronize the 'associative rule‘ of gpu to cpu’s?(Or any other solutions)

GT40 · 16. August 2021 um 00:58

Hello, This is my second question, I want to share problems for all jcuda users.

So the question is : How to synchronize the ‚associative rule‘ of gpu to cpu’s?

Because As I heard, In computing, The floating point doesn’t warrant associative rule.

So I had simple experiment, using jcublas-dot product.
Here is the test code (I tested on Jcuda&Jcublas 11.4.1, CUDA 11.4.1, RTX-2070 super)

import jcuda.*;
import jcuda.jcublas.*;

public class JCudaTest {

	public static void main(String[] args) {
		
		int total_len = 10000;
		
		double[] a = new double[total_len];
		double[] b = new double[total_len];
		
		
		for(int i = 0 ; i < total_len; i++) {
			double init = i;
			a[i] = init*init;
			b[i] = init*init*init;
		}
			
		
		double c = 0.0;
		for (int i = 0; i < a.length; i++) {
			c += a[i] * b[i];
		} 
				
				
		Pointer d_A = new Pointer();
		Pointer d_B = new Pointer();

		/* Initialize JCublas */
	    JCublas.cublasInit();
	    
	    /* Allocate host memory for the matrices */
		double h_A[] = a;
		double h_B[] = b;
	
	    /* Allocate device memory for the matrices */
	    JCublas.cublasAlloc(h_A.length, Sizeof.DOUBLE, d_A);
	    JCublas.cublasAlloc(h_B.length, Sizeof.DOUBLE, d_B);

	    /* Initialize the device matrices with the host matrices */
	    JCublas.cublasSetVector(h_A.length, Sizeof.DOUBLE, Pointer.to(h_A), 1, d_A, 1);
	    JCublas.cublasSetVector(h_B.length, Sizeof.DOUBLE, Pointer.to(h_B), 1, d_B, 1);

	    /* Performs operation using JCublas */
	    double d = JCublas.cublasDdot(h_A.length, d_A, 1, d_B, 1);
    
	    /* Memory clean up */
	    JCublas.cublasFree(d_A);
	    JCublas.cublasFree(d_B);
    
	    /* Shutdown */
	    JCublas.cublasShutdown();

		System.out.println(c - d);

	}

}

The test result was different.

Here is different associative rule image between cpu and gpu.

Of course, if vector length just 4, we can synchronize gpu associative rule to cpu’s using allocate more space to gpu’s vector operation.
But, if vector length will be very long(over 1000000), How to solve it?

Thank you for reading it, And if you have any solutions about this problem, please answer to me!

TMII · 16. August 2021 um 12:22

If you are computing something with floating point, you know the result. If you compute it two times, you don’t.

It is generally speaking a bad idea to compute something two times, especially floating point for that matter. The Single source of truth - Wikipedia is best practice. Avoid redundancy at all times.
The error is negligible. I don’t see how this potentially impacts anything in the universe but I am just an engineer. I don’t know what you math-guys are doing
You could calculate your numbers with integers instead. Instead of 1.2+1.2 you do 12+12 and then divide by 10.
Have a look at fixed point arithmetic and arbitrary-precision arithmetic if CUDA supports any of that.

Floating Point and IEEE 754 :: CUDA Toolkit Documentation (nvidia.com)

Marco13 · 16. August 2021 um 15:17

As TMII already said (and as it was already mentioned in the other thread): The exact result depends on the algorithm, and usually, such small inaccuracies are not a problem. Specifically, the dot product computation on the GPU usually involving a „scan“, where sub-results are computed by the different GPU cores, and the accumulated.

Some parts of this may be implemented similar to what is described in Chapter 39. Parallel Prefix Sum (Scan) with CUDA - although I’m mainly pointing to that link to say: It’s complicated, and the fact that floating point computations are no perfectly associative will always cause small differences in the result.

Usually, these small differences are negligible, though. (And to point that out: The result that is computed with the CPU is also not „correct“. It is just „wrong in a different way“. That doesn’t tell you much…)

GT40 · 16. August 2021 um 22:33

Yes. I recognized that cpu is also incorrect… I’m so appreciate your kind & detailed explanation. Thanks again!

This week, I reviewed your github ‚jcuda-samples‘ , JCublas2SgemmExSample.java

github.com

jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/jcublas/samples/JCublas2SgemmExSample.java

/*
 * JCuda - Java bindings for NVIDIA CUDA
 *
 * Copyright 2008-2016 Marco Hutter - http://www.jcuda.org
 */
package jcuda.jcublas.samples;

import static jcuda.cudaDataType.CUDA_R_32F;
import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasGemmEx;
import static jcuda.jcublas.JCublas2.cublasGetVector;
import static jcuda.jcublas.JCublas2.cublasSetVector;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO0;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO2;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO4;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO5;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO6;
import static jcuda.jcublas.cublasGemmAlgo.CUBLAS_GEMM_ALGO7;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;

This file has been truncated. show original

I have question for cublasGemmAlgo
what is it?
is it about to Algorithm multiply & add?

And Does your library jcublas support fixed point?(ex. long or integer method. cublasIntgemm or cublasLonggemm)

GT40 · 16. August 2021 um 22:36

Thank you for comment!

The test values are all integers(don’t have 0.xxxx). but you know… jcublas supports float and double.
Yeah I’m continuously watching it… lol

Marco13 · 17. August 2021 um 00:14

JCublas offers only the functions that are also offered by CUBLAS, and this means that it only supports float and double. (I considered to try and implement support for half, but that’s a different story).

Regarding cublasGemmAlgo: This is the same as described cuBLAS :: CUDA Toolkit Documentation . I assume that there are, for example, optimized versions for GEMM for the case that a matrix is quadratic, or the case that alpha or beta have special values (like 0.0 or 1.0). But the exact differences between the algorithms are not documented. On newer GPUs, one should probably just use CUBLAS_GEMM_DEFAULT.