Problem with Inkonsistent Calculation

Noodles · 5. November 2015 um 16:57

Hi,

I’ve written a simple Matrix Multiplikation Kernel

 
#include "stdio.h"

 __global__ void KernelMul(float *A, float *B, float *C,int ar,int ac, int br, int bc)
{
	float Cvalue = 0;
	int row = blockIdx.x * blockDim.x + threadIdx.x;
	int col = blockIdx.y * blockDim.y + threadIdx.y;
	if(row > ar || col > ac) return;
	for (int e = 0; e < ac; ++e) {
		Cvalue += A[row * ac + e] * B[e * bc + col];
		printf("%f 
", Cvalue);
	}
	C[row * bc + col] = Cvalue;
}```

and the Java Code using this is

```public static float[] mul(float[] hostInputA,float[] hostInputB,int ar,int ac,int br,int bc) {
		JCudaDriver.setExceptionsEnabled(true);
		
		String ptxFileName = "src\\Matrix\\MatrixMul.ptx";

        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        // Load the ptx file.
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);

        // Obtain a function pointer to the "add" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "KernelMul");
		
		CUdeviceptr deviceInputA = new CUdeviceptr();
        cuMemAlloc(deviceInputA, hostInputA.length * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),
        		hostInputA.length * Sizeof.FLOAT);
        CUdeviceptr deviceInputB = new CUdeviceptr();
        cuMemAlloc(deviceInputB, hostInputB.length * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),
        		hostInputB.length * Sizeof.FLOAT);
        
        /*CUdeviceptr deviceInputar = new CUdeviceptr();
        cuMemAlloc(deviceInputB, Sizeof.INT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(ar),Sizeof.INT);
        CUdeviceptr deviceInputac = new CUdeviceptr();
        cuMemAlloc(deviceInputB, Sizeof.INT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(ac),Sizeof.INT);
        CUdeviceptr deviceInputbr = new CUdeviceptr();
        cuMemAlloc(deviceInputB, Sizeof.INT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(br),Sizeof.INT);
        CUdeviceptr deviceInputbc = new CUdeviceptr();
        cuMemAlloc(deviceInputB, Sizeof.INT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(bc),Sizeof.INT);*/

        // Allocate device output memory
        CUdeviceptr deviceOutput = new CUdeviceptr();
        cuMemAlloc(deviceOutput, hostInputA.length * Sizeof.FLOAT);
        
        // Set up the kernel parameters: A pointer to an array
        // of pointers which point to the actual values.
        Pointer kernelParameters = Pointer.to(
            Pointer.to(deviceInputA),
            Pointer.to(deviceInputB),
            Pointer.to(deviceOutput),
            Pointer.to(new int[]{ar}),
            Pointer.to(new int[]{ac}),
            Pointer.to(new int[]{br}),
            Pointer.to(new int[]{bc})
        );
        
        // Call the kernel function.
        int blockSizeX = 16;
        int blockSizeY = 16;
        int gridSizeX = (int)Math.ceil(((bc+blockSizeX-1)/blockSizeX));
        int gridSizeY = (int)Math.ceil(((ar+blockSizeY-1)/blockSizeY));
        //System.out.println(gridSizeX);
        //System.out.println(gridSizeY);
        JCudaDriver.cuCtxSetLimit(CUlimit.CU_LIMIT_PRINTF_FIFO_SIZE, 4096);
        cuLaunchKernel(function,
            gridSizeX,  gridSizeY, 1,      // Grid dimension
            blockSizeX, blockSizeY, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
        int k = cuCtxSynchronize();
     	System.out.println(JCuda.cudaGetErrorString(k));
        
        /*cudaError_t cudaerr = cudaDeviceSynchronize();
        if (cudaerr != CUDA_SUCCESS)
            printf("kernel launch failed with error \"%s\".
",
                   cudaGetErrorString(cudaerr));*/
        
        // Allocate host output memory and copy the device output
        // to the host.
        float hostOutput[] = new float[hostInputA.length];
        cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,
        		hostInputA.length * Sizeof.FLOAT);
        
        cuMemFree(deviceInputA);
        cuMemFree(deviceInputB);
        cuMemFree(deviceOutput);
        
        return hostOutput;
	}```

When Multiplying the 4 x 4 Matrix with all 1 the Result is either correct so ((4,4,4,4),(4,4,4,4),(4,4,4,4),(4,4,4,4)) or a wrong one ((4,4,4,4),(4,4,4,4),(3,4,4,4),(4,4,4,4)).
It seems to be Random if it misscalculates or not, so how could I debug a Problem ?
Also I have tested a few of the samples and they seem to have worked.

I hope someone can help.

Marco13 · 5. November 2015 um 22:38

Nachtschicht

That’s basically caused by some small „typo“: The line
if(row > ar || col > ac) return;
should be
if(row >= ar || col >= ac) return;

In general detecting bugs like this one is a bit inconvenient (compared to a fully integrated debugger in an IDE that natively supports CUDA). But at least, when running the program with cuda-memcheck java -cp bin;jcuda.jar MatMul as described on jcuda.org - Debugging , the result will indicate that there is something wrong in the kernel…:


Exception in thread "main" jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
        at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:312)
        at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1907)
        at MatMul.mul(MatMul.java:114)
        at MatMul.main(MatMul.java:36)
========= Invalid __global__ read of size 4
=========     at 0x000000d0 in KernelMul
=========     by thread (4,2,0) in block (0,0,0)
=========     Address 0x00250040 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32
vcuda.dll (cuLaunchKernel + 0x13b) [0xb06b]

When compiling the PTX file with -lineinfo, then it will also print the line where the invalid read happened, as described on the site linked above.

BTW: For a „real“ application, you would not code a matrix multiplication on your own, but use CUBLAS / JCublas instead. But I assume that this is only intended for basic tests and practicing, and a matrix multiplication (ranging from the first, basic tests like the current one, or a more „sophisticated“ one that uses shared memory) is probably a good starting point for this.

Noodles · 6. November 2015 um 04:14

Yeah I wasworking a bit late, now im wondering are you German or is “Nachtschicht” a normal word to use in English ?

Actually I am wanting to use these Calculations for Neural Networks, and I am not sure if there are enough functions Provided to, add, multiply, multiply skalarly(xM with x in R), multiply by element (M1M1, M2M2), apply a sigmoid funtion to each element (e.g 1/(1+(1/e^x))). Also all of these should expect the same input type, float or Matrix or some other type that is equivalent.

cheers.

Marco13 · 6. November 2015 um 05:29

I’m German.

Applying these “bulk” operations like add/multiply/scale to vectors (or analogously, matrices) is in fact rather different from a matrix multiplication.

For the matrix multiplication, there are many tricks in order to make it fast - primarily, the use of shared memory, and, related to that, applying the algorithm in a “block-wise” fashion with appropriate block sizes in order to maximize occupancy and all that. (When you have the chance: Take a look at the CUBLAS source code, and you’ll see that this is far from trivial). So if you want real (fast) BLAS Level 2 operations, you should probably use CUBLAS/JCublas for these.

Some of the “bulk” operations are covered with the BLAS Level 1 operations. But the implementation of these is rather straightforward: You simply launch a kernel with the desired grid+block size, and there’s not much that you can do wrong. Therefore, the jcuda-vec utilities might come in handy here. The BenchmarkVecDoubleVsOwnKernel sample on GitHub coincidentally shows how you could implement the logistic function with jcuda-vec (it is a tad slower than a dedicated kernel, but more convenient).