Matrix Multiplication with KernelLauncher

I tried to alter KernelLauncherSample to multiplication of matrices but the output is not the expected:
Preparing the KernelLauncher…
GPU Device used = GeForce 9300M GS
Creating input data…
Initializing device memory…
Calling the kernel…
Obtaining results…
Result[0][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[1][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[2][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[3][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[4][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[5][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[6][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Result[7][0] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
BUILD SUCCESSFUL (total time: 1 second)

Source code in:
http://hotfile.com/dl/54876973/f068132/KernelLauncherSample.java.html

Hello,

I have moved this to an own thread, since it seemed not to be related to the one where it was posted originally.

I’m currently not at a CUDA-capable PC, but if necessary, I can test this in more detail on Sunday or Monday. For now, I had a short look at the source code… It looks strange… The Pointer#to-Method creates a new Pointer, which in case of a function call like

            JCuda.cudaMalloc(Pointer.to(deviceX**), Sizeof.FLOAT * TILE_WIDTH );

will be thrown aray directly after the function call.

Additionally, it is not completely clear to me what you intended to do, since you are only using matrices of size TILE_WIDTH, and somehow attempted to create … Pointers for each row? :confused:

The kernel expects memory to be given as 1D arrays. If you have a Matrix like

float matrixA[][] = new float[TILE_WIDTH][TILE_WIDTH];

then you could probably (!) copy it to device memory and pass it to the kernel like this (untested) :

// Allocate memory on the device, for the whole matrix
Pointer deviceMatrixA = new Pointer();
JCuda.cudaMalloc(deviceMatrixA, Sizeof.FLOAT * TILE_WIDTH * TILE_WIDTH);

// Copy memory from host to device
for (i = 0; i < TILE_WIDTH; i++)
{
    // Copy a single row from the host matrix[][] into the
    // device memory. The Pointer which describes the 
    // appropriate location in device memory has an offset
    // of TILE_WIDTH*i float elements from the beginning
    // of the matrix on the device
    Pointer deviceRow = deviceMatrixA.withByteOffset(Sizeof.FLOAT *  TILE_WIDTH * i);
    JCuda.cudaMemcpy(deviceRow, Pointer.to(matrixA**), Sizeof.FLOAT *  TILE_WIDTH, cudaMemcpyKind.cudaMemcpyHostToDevice);
}

kernelLauncher.call(Pointer.to(deviceMatrixA), ....);

Hope that helps.

bye

yeah, it was the idea… creating a pointer to each row and than copy it to device memory! since we cannot do Pointer.to( int[][] )

The Kernel expects the data as a float* anyhow (and not as a float**, which would roughly be a “2D array”). So, as far as I understood your intention, you only have to copy each row from the host array[][] into the right place of the device[] array.

I tried the solution that you gave and i think it worked. Why “think”? well, because i tried to multiply the example in this page http://www.intmath.com/Matrices-determinants/4_Multiplying-matrices.php which states that
|8 9| x | -2 3| = | 20 24|
|5 -1| | 4 0| |-14 15|

Used a kernel function from the SDK examples (matrixMulDrv) and curiously i get slightly different values in a C program and in a Java program:
C results:
Initial Values… :
M [0][0] = 8.0e+00 M [0][1] = 9.0e+00
M [1][0] = 5.0e+00 M [1][1] = -1.0e+00

N [0][0] = -2.0e+00 N [0][1] = 3.0e+00
N [1][0] = 4.0e+00 N [1][1] = 0.0e+00
GPU Processing time: 0.123000 (ms)

Final Result… :
P [0][0] = 3.555865 P [0][1] = 3.497642
P [1][0] = 3.497642 P [1][1] = 4.456623

Java results:
Preparing the KernelLauncher…
GPU Device used = GeForce 9300M GS
Creating input data…
Initializing device memory…
Calling the kernel…
Obtaining results…
Row[0] = [3.5558653, 3.497642]
Row[1] = [4.4566226, 3.8795402]

C program:
http://hotfile.com/dl/55067381/c9c72fd/functions_Kernel.cu.html
http://hotfile.com/dl/55067434/62156f1/multiplicationMatrix.cu.html

Java program:
http://hotfile.com/dl/55067533/3f280e2/KernelLauncherSample.java.html
(cubin file compiled from the file functions_kernel.cu above)

OK, if this problem persists, I’ll have closer look at it on Sunday or Monday when I’m back at my development-PC.

tks

Hello

As far as I can see, the Kernel is not intended for multiplying general matrices. It is only intended for (and working with) matrices that have a size that is a multiple of the BLOCK_SIZE. So introducing a TILE_SIZE and trying to multiply 2x2 matrices will most likely not work. As the comment in the NVIDIA sample file says: For general matrix multiplication (with high performance) you should consider using the CUBLAS library. Or JCublas, in this case :wink:

Additionally, the setup for the kernel has not been correct. In the original NVIDIA sample, the kernel is set up as

cuFuncSetBlockShape( matrixMul, BLOCK_SIZE, BLOCK_SIZE, 1 );
cuFuncSetSharedSize( matrixMul, 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float) );
cuLaunchGrid( matrixMul, WC / BLOCK_SIZE, HC / BLOCK_SIZE );

You have to specify the appropriate block size, shared memory size and grid size in the kernel launcher as well:

kernelLauncher.setBlockSize(BLOCK_SIZE, BLOCK_SIZE, 1);
kernelLauncher.setSharedMemSize(2*BLOCK_SIZE*BLOCK_SIZE*Sizeof.FLOAT);
kernelLauncher.setGridSize(WC / BLOCK_SIZE, HC / BLOCK_SIZE);

I have created an example summarizing these changes, and some minor cleanups and generalizations.

This example assumes the original „matrixMul_kernel.cu“ from the NVIDIA sample to be present. If you wish, you may replace this with your pre-compiled CUBIN file.

import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaMemcpyKind.*;

import java.util.*;

import jcuda.*;
import jcuda.driver.JCudaDriver;

public class KernelLauncherMatrixMult
{
    private static Random random = new Random(0);
    
    public static void main(String args[])
    {
        // Thread block size
        final int BLOCK_SIZE = 16;

        // Matrix dimensions
        // (chosen as multiples of the thread block size for simplicity)
        final int WA = (3 * BLOCK_SIZE); // Matrix A width
        final int HA = (5 * BLOCK_SIZE); // Matrix A height
        final int WB = (8 * BLOCK_SIZE); // Matrix B width
        final int HB = WA;  // Matrix B height
        final int WC = WB;  // Matrix C width 
        final int HC = HA;  // Matrix C height
        
        JCudaDriver.setExceptionsEnabled(true);

        // Prepare the kernel
        System.out.println("Preparing the KernelLauncher...");
        final boolean forceRebuild = false;
        KernelLauncher kernelLauncher = 
            KernelLauncher.create("matrixMul_kernel.cu", "matrixMul", forceRebuild);
        
        // Create the input data
        System.out.println("Creating input data...");
        float matrixA[][] = createRandomFloatArray(WA, HA);
        float matrixB[][] = createRandomFloatArray(WB, HB);
        float matrixC[][] = new float[HC][WC];

        // Allocate the device memory
        Pointer deviceMatrixA = new Pointer();
        cudaMalloc(deviceMatrixA, Sizeof.FLOAT * HA * WA);
        Pointer deviceMatrixB = new Pointer();
        cudaMalloc(deviceMatrixB, Sizeof.FLOAT * HB * WB);
        Pointer deviceMatrixC = new Pointer();
        cudaMalloc(deviceMatrixC, Sizeof.FLOAT * HC * WC);

        // Copy memory from host to device
        copyToDevice(deviceMatrixA, matrixA);
        copyToDevice(deviceMatrixB, matrixB);
                
        // Call the kernel
        System.out.println("Calling the kernel...");
        kernelLauncher.setBlockSize(BLOCK_SIZE, BLOCK_SIZE, 1);
        kernelLauncher.setGridSize(WC / BLOCK_SIZE, HC / BLOCK_SIZE);
        kernelLauncher.setSharedMemSize(2*BLOCK_SIZE*BLOCK_SIZE*Sizeof.FLOAT);
        kernelLauncher.call(deviceMatrixC, deviceMatrixA, deviceMatrixB, WA, WB);

        // Copy the result from the device to the host
        copyFromDevice(matrixC, deviceMatrixC);
        
        // Compute the reference solution and compare the results
        float matrixCref[] = new float[WC*HC];
        matrixMulHost(matrixCref, to1D(matrixA), to1D(matrixB), HA, WA, WB);
        boolean passed = equalNorm1D(to1D(matrixC), matrixCref);
        System.out.println(passed ? "PASSED" : "FAILED");

        // Only enable this flag for small matrices!
        final boolean printResults = false;
        if (printResults)
        {
            // Print the computation and the results
            System.out.println("matrixA:");
            System.out.println(toString2D(matrixA));
    
            System.out.println("matrixB:");
            System.out.println(toString2D(matrixB));
    
            System.out.println("matrixC:");
            System.out.println(toString2D(matrixC));
    
            System.out.println("matrixC ref:");
            System.out.println(toString2D(matrixCref, WC));
        }

        // Clean up
        cudaFree(deviceMatrixA);
        cudaFree(deviceMatrixB);
        cudaFree(deviceMatrixC);
    }
    
    /**
     * Copies the data from the given array to the given device pointer.
     * 
     * @param device The pointer to copy to 
     * @param array The array to copy from
     */
    private static void copyToDevice(Pointer device, float array[][])
    {
        int rowSizeBytes = Sizeof.FLOAT * array[0].length;
        for (int i = 0; i < array.length; i++)
        {
            Pointer deviceRow = device.withByteOffset(rowSizeBytes * i);
            cudaMemcpy(deviceRow, Pointer.to(array**), 
                rowSizeBytes, cudaMemcpyHostToDevice);
        }
    }

    /**
     * Copies the data from the given device pointer to the given array.
     * 
     * @param array The array to copy to
     * @param device The pointer to copy from 
     */
    private static void copyFromDevice(float array[][], Pointer device)
    {
        int rowSizeBytes = Sizeof.FLOAT * array[0].length;
        for (int i = 0; i < array.length; i++)
        {
            Pointer deviceRow = device.withByteOffset(rowSizeBytes * i);
            cudaMemcpy(Pointer.to(array**), deviceRow, 
                rowSizeBytes, cudaMemcpyDeviceToHost);
        }
    }
    
    /**
     * Matrix multiplication for computing the reference
     * 
     * @param C The result matrix
     * @param A Matrix A
     * @param B Matrix B
     * @param hA Height of A
     * @param wA Width of A
     * @param wB Width of B
     */
    static void matrixMulHost(float C[], float A[], float B[], int hA, int wA, int wB)
    {
        for (int i = 0; i < hA; i++)
        {
            for (int j = 0; j < wB; j++)
            {
                double sum = 0;
                for (int k = 0; k < wA; k++)
                {
                    double a = A[i * wA + k];
                    double b = B[k * wB + j];
                    sum += a * b;
                }
                C[i * wB + j] = (float)sum;
            }
        }
    }

    
    //=== Some utility functions =============================================
    public static boolean equalNorm1D(float a[], float b[])
    {
        if (a.length != b.length)
        {
            return false;
        }
        float errorNorm = 0;
        float refNorm = 0;
        for (int i = 0; i < a.length; i++) 
        {
            float diff = a** - b**;
            errorNorm += diff * diff;
            refNorm += a** * a**;
        }
        errorNorm = (float)Math.sqrt(errorNorm);
        refNorm = (float)Math.sqrt(refNorm);
        return (errorNorm / refNorm < 1e-6f);
    }
    private static float[][] createRandomFloatArray(int w, int h)
    {
        float result[][] = new float[h][w];
        for (int i=0; i<h; i++)
        {
            for (int j=0; j<w; j++)
            {
                result**[j] = random.nextFloat();
            }
        }
        return result;
    }
    private static float[] to1D(float array[][])
    {
        float result[] = new float[array.length*array[0].length];
        int index = 0;
        for (int i=0; i<array.length; i++)
        {
            for (int j=0; j<array**.length; j++)
            {
                result[index++] = array**[j];
            }
        }
        return result;
    }
    public static String toString2D(float a[][])
    {
        StringBuilder sb = new StringBuilder();
        for (int i=0; i<a.length; i++)
        {
            sb.append(toString1D(a**));
            sb.append("
");
        }
        return sb.toString();
    }
    public static String toString1D(float a[])
    {
        StringBuilder sb = new StringBuilder();
        for (int i=0; i<a.length; i++)
        {
            sb.append(String.format(Locale.ENGLISH, "%6.3f  ", a**));
        }
        return sb.toString();
    }
    public static String toString2D(float[] a, int columns)
    {
        StringBuilder sb = new StringBuilder();
        for (int i=0; i<a.length; i++)
        {
            if (i>0 && i % columns == 0)
            {
                sb.append("
");
            }
            sb.append(String.format(Locale.ENGLISH, "%6.3f  ", a**));
        }
        return sb.toString();
    }
    
}

thanks :slight_smile:

There is one thing that I never understood which is the way I should setup the BlockDim, Block and ThreadNum to feat my solution. How do I should think? For instance, I’ve done one simple example in C for reducing a vector (adding) and if I called the kernel function like kernelFunc <<<1, 1, 1 >>> (args) this wouldn’t work… but If I randomly changed the number of blocks to (at least the size of the vector to be reduced) that worked…

That heavily depends on your kernel. Namely, on how the indices are computed from the built-in variables, threadIdx, blockDim, blockIdx and gridDim.

You probably know this image:

In this image, the first grid has 3x2 blocks. Each block has 5x3 threads. So there are (35) x (23) = 15x6 threads in the grid. This could be used for computing something on a matrix of size 15x6. Inside the kernel function, you can access each element of the matrix by computing its index like that:


int indexX = threadIdx.x + blockDim.x*blockIdx.x;
int indexY = threadIdx.y + blockDim.y*blockIdx.y;
float value = matrix[indexX+indexY*matrixWidth];

As an example: If you want to multiply all elements of a vector with 2, then you could use


__global__ void mul(float *vector, int size)
{
    for (int i=0; i<size; i++) vector** *= 2;
}

and launch this with
mul<<<1,1>>>(vector, size);
That is, with a single thread doing all the work - and this would be horribly slow, of course :slight_smile:

It would be slightly better to use a kernel like that:


__global__ void mul(float *vector, int size)
{
    int index = threadIdx.x;
    vector[index] *= 2;
}

which could be called with
mul<<<size,1>>>(vector, size);
In this case, at least all threads of a single block would to the multiplication in parallel. But you only can have a limited number of threads per block (512 on current devices, I think), so you could not use it for larger vectors.

The most general solution would be to write


__global__ void mul(float *vector, int size)
{
    int index = threadIdx.x + blockDim.x*blockIdx.x;
    if (index < size)
    {
        vector[index] *= 2;
    }
}

and call it with


int numThreadsPerBlock = 256;
int numBlocksPerGrid = roundUp(size / numThreadsPerBlock);
mul<<<numThreadsPerBlock,numBlocksPerGrid>>>(vector, size);

In this case, 256 threads are running in each block. The first block is computing the result for the first elements 0-255, the second one for the elements 256-511 …

For more complex algorithms, you additionally have consider using shared memory and take care of memory coalescing. The NVIDIA sample „reduction“ contains a document „reduction.pdf“ which is a step-by-step description of these optimizations, but … I also don’t really understand all of that :o

That simply was a huge help.

And yes, some of the optimizations that I had the „pleasure“ to analyse are more or less unclear at least. now I’m trying to understand how to use cuBlas lib… 'cause if there’s already a way to multiply matrices (for instance) why not use it? :slight_smile:

Have you ever used or known the high end “thrust” library? http://code.google.com/p/thrust/

It seems a good way to ease programmers work…

I have not used it myself, but I know that there is a strong demand for more… high-level approaches to parallel programming. Unfortunately, there is not (yet) a high level library for CUDA on Java. When I started with JCuda, I quickly realized that with „pure“ low-level CUDA, some programming tasks are tedious, repetitive and simply inconvenient. I intended to write a nice, Object-Oriented abstraction layer. And… admittely… the „KernelLauncher“ mainly consists of a few fragments of code that resulted from my first tests in this direction. Then OpenCL came up, and I also intended to create an OO-layer on top of JOCL, to simplify the development, but Michael Bien (JOCL) and Olivier Chafik (JavaCL) had been faster :wink:

So I postponed the development of an OO layer for JCuda, but I did not completely drop this idea. In fact, I recently thought about a more general approach: Creating an OO-library which offers some parallel primitives (similar to the functions offered by thrust: Sorting, Scans/Reductions, Permutations etc.), and whose implementations may be exchanged arbitrarily (using Java, CUDA or OpenCL). I read a little bit about vector machine architectures and primitives and languages for parallel programming, but this probably is nothing that I can do solely in my spare time - so most likely, someone else will be faster again :wink:

can you help in here?
if I do :
retCublas = cublasGetMatrix (MATRIX_SIZE, MATRIX_SIZE, sizeof(int),&dResult, MATRIX_SIZE * MATRIX_SIZE, result, MATRIX_SIZE * MATRIX_SIZE);
I get “cublasGetMatrix = MAPPING_ERROR”! But if i do:
retCublas = cublasGetMatrix (MATRIX_SIZE, MATRIX_SIZE, sizeof(int),dResult, MATRIX_SIZE * MATRIX_SIZE, result, MATRIX_SIZE * MATRIX_SIZE);
I get “cublasGetMatrix = cudaSuccess”… however I simply cannot access the matrix result.

If I try to surround the problem without cuBlas and I still cannot work with the matrix result:
//retCuda = cudaMemcpy2D(result,MATRIX_SIZEsizeof(int),dResult,MATRIX_SIZE,MATRIX_SIZEsizeof(int),MATRIX_SIZEsizeof(int),cudaMemcpyDeviceToHost);
//retCuda = cudaMemcpy ((void
)&result, (void*)&dResult, MATRIX_SIZE * MATRIX_SIZE, cudaMemcpyDeviceToHost);

Because I get the error “cudaMemcpy Error = cudaErrorInvalidValue”

all code in here: http://hotfile.com/dl/56659454/1eaf471/multiplicationMatrixCublas.cpp.html

Allocating a 1D array and accessing it in this form as a 2D array will most likely not work in general. Apart from that, you seem to be mixing and casting int* and float* arbitrarily. You should have a look at the “simpleCUBLAS” sample from NVIDIA, which does exactly what you intended to do here: A SGEMM using CUBLAS. A simplified version (without error checks) and a comparison of the CUBLAS program to the JCublas version is also shown on the JCublas main site.

Yes I do mixing cast with int* and float* on purpose because of the cublas function headers. I want to multiply INT matrices and not float matrices… and I already looked at simpleCUBLAS and saw that the example in there was performed without int** but float*.

Other thing is i’m not allocating 1D array… i’m using
cublasAlloc (MATRIX_SIZE * MATRIX_SIZE, sizeof(int), (void**)&dM);

which accepts “cublasAlloc (int n, int elemSize, void **devicePtr)”… so I don’t see how it’s in 1D vector…

Um … -_- that’s most likely not going to work. Casting an int* pointer to a *float will not convert the individual array elements. It will only cast the pointer itself. In CUBLAS, the elements will be interpreted as float values, and it will perform floating-point operations on these “bits” that should actually be ints. The results will be a lot of meaningless numbers and many "NaN"s.

As an example: When you do something like


int i = 0x00001122; // int value, decimal: 4386
int *ip = &i; // Pointer to int
float *fp = (float*)ip; // Cast pointer to float
float f = *fp; // Obtain float value from pointer

then the result will not be the float value 4386.0f, but the float value of the bit pattern that corresponds to the Hex value “0x00001122”, namely 6.14e-42 …

So to multiply int matrices, you have the alternatives of converting them to float matrices (element by element), or going back to your own matrix multiplication kernel which may multiply int matrices…

[QUOTE=wepawetmose]
Other thing is i’m not allocating 1D array… i’m using
cublasAlloc (MATRIX_SIZE * MATRIX_SIZE, sizeof(int), (void**)&dM);
which accepts “cublasAlloc (int n, int elemSize, void **devicePtr)”… so I don’t see how it’s in 1D vector…[/QUOTE]

The pointer that is passed to cublasMalloc is a pointer to the place where the pointer that points to the allocated elements should be stored - that is, a pointer to a pointer. cublasAlloc will internally create a new pointer to the allocated elements. And this pointer is stored inside the given pointer.

hi there… i’ve just tried to do the simplest function_kernel in C to reduce a vector and… honestly, just to start to make good kernels. So, to arrive in there, i have to start somewhere… :stuck_out_tongue:

/// Kernel function
extern „C“ global void reductionVector(float* d_vector, float* oVector, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
shared float result;
result = 0;
if (i < size)
result += d_vector**;

*oVector = result;

}

I know that this code is a mess… but first i want it functional (logic) :stuck_out_tongue: the outcome is „3.31904e-39“… and it doesnt matter if the vector has 100 or a 1 000 elements. obviously, something is wrong with this first kernel version. do you know why? I’ve tried a litle of everything just to make it work :stuck_out_tongue:

Hello

The __shared memory is shared among all threads of a block. So if you had multiple blocks, you would also have multiple ‚result‘ values.

But even if you only have a single block: You can not write to the same shared memory from multiple threads. In fact, two or more threads will try to write data into the ‚result‘ at the same time, which will result in a huge mess :slight_smile: I think this could be done by using „Atomic functions“ (Section B.11 of the Programming Guide) : They make sure that there is always a consistent state written to the respective memory location. But admittedly, I have not yet used these functions, mainly because they have a severe drawback: If multiple threads try to access the same memory location, then the atomic functions will force them to do the access sequentially - so in your example, the threads would add up the elements one after another - no parallelism at all -_-

There is a whitepaper on reduction (PDF file) on the NVIDIA samples website. For me this was one of the best resources to (at least partially :o ) understand the difficulties and „magic tweaks“ of CUDA programming, although I’m only a beginner as well. The whitepaper shows 7 (seven!) possible implementations, starting with a „naive“ implementation, and successively optimizing it… admittedly, I was not able to follow all these steps, but nevertheless, you might also find it interesting.

bye
Marco