Matrix Multply invalid Value


I am trying to multply two matrix with codes (and matrixMul.h matrixMul_kernel.h)

matrix1 = 4x4 dimension
matrix2 = 4x1 dimension

but it gives me error and says it is invalid value .(cudaErrorInvalidValue ) but when I change the dimension I dont get error.

What do you think about that ?



You’re referring to the example in this thread? Note that the matrix multiplication example kernel that is used there uses a fiex BLOCK_SIZE, and the sizes of the matrices have to be a multiple of this block size.

In any case, you should not assume any speedup for such small matrices. CUDA will be faster for “larger” matrices, maybe of size 128x128 (or maybe only for even larger ones). For small matrices, the CUDA version will most likely be much slower than even the most trivial Java implementation.

Apart from that: If you only want to do a plain matrix multiplication, you could consider using JCublas, which could have several advantages:

  • It will be much simpler to use than a “manual” MatrixMult using the Driver API
  • It allows matrices of arbitrary size (no fixed BLOCK_SIZE)
  • It directly delegates to CUBLAS, and one can assume that there is less overhead for the invocation - so it would probably also be faster for “relatively small” matrices.


hi Marco;

I am trying to test a code it takes too much time with jcuda. My codes is very simple kernel takes two array and add them. When I test it it takes with java „CPU MicroSecond : 19“ but with jcuda " GPU MicroSecond : 17663 " :frowning:
something must be wrong but I dont understand. Maybe you can help me.

Thanks for your answers.

public class Add {

static int n=512;

// these are my arrays on host
static float A[]=new float [n];
static float B[]=new float [n];
static float C[]=new float [n];

public static void main(String[] args)  {
    System.out.println("initialize the arrays ");
    for(int i=0;i<n;i++)
     System.out.println("Performing with Java...");

    System.out.println(" Performing with JCUDA  ");

public static void add(float[] A, float B[],float[] C,int n)
    long startTime = System.nanoTime();  

    for(int i=0;i<n;i++)
    // I calculate the time 
    long estimatedTime = System.nanoTime() - startTime;
    System.out.println("CPU Nanoto MicroSecond  : "+TimeUnit.MICROSECONDS.convert(estimatedTime, TimeUnit.NANOSECONDS));


public static void with_jcuda()
    int nn = n * n;


    //My pointers on device 
    Pointer d_A = new Pointer();
    Pointer d_B = new Pointer();
    Pointer d_C = new Pointer();
 // Allocate memory on the device

    cudaMalloc(d_A, Sizeof.FLOAT *n);
    cudaMalloc(d_B, Sizeof.FLOAT *n);
    cudaMalloc(d_C, Sizeof.FLOAT *n);
    //Copy values
            n*Sizeof.FLOAT, cudaMemcpyHostToDevice);
            n*Sizeof.FLOAT, cudaMemcpyHostToDevice);
            n*Sizeof.FLOAT, cudaMemcpyHostToDevice);
    final boolean forceRebuild = false;
    KernelLauncher kernelLauncher =
        KernelLauncher.create("", "add", forceRebuild);
    System.out.println("Calling the kernel...");
    kernelLauncher.setBlockSize(16, 16, 1);
    kernelLauncher.setGridSize(n/16, 1);,d_B,d_C,n);
    //Copy Results 
   cudaMemcpy(, d_C,Sizeof.FLOAT *n, cudaMemcpyDeviceToHost);



and this is my file

extern „C“
global void add( float* A, float* B, float* C, int n)
int tid;


Hi again,

I solved a few problems but it still takes to much.

I changed


KernelLauncher kernelLauncher =
KernelLauncher.create(“”, “add”, forceRebuild);

    System.out.println("Calling the kernel...");
    dim3 idyThreads = new dim3(BLOCKSIZE, 1, 1);
    dim3 idyBlocks = new dim3(n / BLOCKSIZE, 1, 1);
    kernelLauncher.setup(idyBlocks, idyThreads);,d_B,d_C,n);

and here is myc .cu file

extern “C”
global void add( float* A, float* B, float* C, int n)
int tid=blockDim.x*blockIdx.x+threadIdx.x;


I looked NVdia programmer Guide Version 2.3.1 Chapter 3.


Measuring the speedup for such a task may be difficult. I don’t see where and how you are measurig the time for CUDA, but you there are several aspects to consider:

  • You should not measure the time for the setup. Creating and initializing the kernel (using the KernelLauncher) should not be taken into account when comparing the times. CUDA should not be considered as a “drop-in-replacement” for arbitrary, trivial operations.

  • For such small benchmarks, there are at least (!) two times which should be taken for the CUDA part:

  1. The time which is required including memory transfers
  2. The time which is required without memory transfers
    The memory transfers may easily become the bottleneck. That means: Copying the input data from the host to the device, and copying the results back from the device to the host, may take longer than the actual execution of the kernel. This is especially true for memory-bound kernels, and…
  • … the kernel you are executing is heavily memory-bound. That means that there is a lot of data which has to be read (namely, the two input arrays), and a lot of data to be written (namely, the output array), and the actual computation that is done with CUDA is fairly trivial: Just one addition, ‘+’. CUDA performs best when there is a lot of computation (arithmetic, trigonometry…) to be performed on a relatively small amount of data.

So for the vector addition, you will most likely not see and speedup at all. Copying the input arrays to the device and copying the result back takes much longer than simply adding the two arrays directly in Java.

For the matrix multiplication, you will most likely see a speedup, but mainly for larger matrices. The explaination for this, slightly simplified, to make the idea clear: Assume you want to perform a multiplication of matrices A*B=C. The number of floating point multiplications for a matrix multiplication is O(n^3).

  • For matrices with size 10x10, you have to copy 300 float values, and perform 100^3 = 1000000 multiplications. So for each value that you copy, you make 1000000/300 = 3333 multiplications.
  • For matrices with size 100x100, you still have to copy 300 float values, and perform 1000^3 = 1000000000 multiplications. So for each value that you copy, you make 1000000000/300 = 3333333 multiplications. In this case, CUDA will perform better, because the time that is required for copying the small amount of data will be smaller than the time that can be saved by making the 3 Million multiplications faster.

I can try to set up a simple benchmark for vector addition (or a similar operation) which makes this point clearer.


Thanks for your reply and I thought like you so that I measure the time like that.I did all transfers and than I just measured the time which the Kernel funtion call

long startTime = System.nanoTime();

//just call the kernel function.…);

long estimatedTime = System.nanoTime() - startTime;
System.out.println(TimeUnit.MICROSECONDS.convert(estimatedTime, TimeUnit.NANOSECONDS));



OK, I can try to set up a benchmark/comparsion but don’t expect ans speedup for a vector addition…

I have created some “SortOfABenchmark” - it’s not really a reliable, expressive Benchmark, just a very biased and artificial example to point this out. It performs two sorts of operations on all elements of two vectors:

  • A simple vector addition: result = a+b
  • Some useless, complex computation: result = sin(a)*sin(a)+cos(a)*cos(a)+sin(b)*sin(b)+cos(b)*cos(b)
    Both operations are applied to vectors of different sizes, 260000 to 8.3 Million

The result for the largest vector is the following on my machine:

Running with 8388608 elements...
                    [ms] Duration:    Average:
            java-simple:        29          29
           jcuda-simple:         2           2
   jcuda-simple-withMem:        52          52
           java-complex:     42317       42317
          jcuda-complex:       111         111
  jcuda-complex-withMem:       163         163

This roughly means that the simple vector addition with plain Java is twice as fast as when it is done with JCuda (including the memory transfer time). However, the complex operation takes 42 Seconds with Java, whereas with JCuda it takes 163 milliseconds.

Again: This is far from an objective or realistic result, and should be taken with the appropriate grain of salt, but shows that in general the main advantages of JCuda show up when there’s a lot of computing work to be done.

import static jcuda.driver.JCudaDriver.*;
import jcuda.*;
import jcuda.driver.*;
import jcuda.utils.*;

public class SortOfABenchmark
    // A kernel performing a simple addition
    private static final String sourceCodeSimple =  
        "extern \"C\"" + "
" +
        "__global__ void compute(float *result, float *a, float *b)" + "
" +
        "{" + "
" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "
" +
        "    result** = a** + b**;" + "
" +

    // A kernel performing a fairly useless but complex computation
    private static final String sourceCodeComplex =  
        "extern \"C\"" + "
" +
        "__global__ void compute(float *result, float *a, float *b)" + "
" +
        "{" + "
" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "
" +
        "    result** = " + "
" +
        "        sin(a**)*sin(a**)+cos(a**)*cos(a**) + " + "
" +
        "        sin(b**)*sin(b**)+cos(b**)*cos(b**);" + "
" +
    private static KernelLauncher kernelSimple;
    private static KernelLauncher kernelComplex;

    public static void main(String args[])
        // Prepare the KernelLaunchers for the simple and the complex kernel
        System.out.println("Preparing the KernelLaunchers...");
        kernelSimple = KernelLauncher.compile(sourceCodeSimple, "compute");
        kernelComplex = KernelLauncher.compile(sourceCodeComplex, "compute");
        System.out.println("Preparing the KernelLaunchers... DONE");
        // Run the test with different input sizes
        for (int blocks = 1024; blocks <= 32768; blocks*=2)
            int size = 256 * blocks;
            System.out.println("Running with "+size+" elements...");


            float result[] = new float[size];
            float a[] = new float[size];
            float b[] = new float[size];
            for (int i=0; i<size; i++)
                a** = i;
                b** = i;

            // Run the simple computation with Java

            // Run the simple computation with JCuda
            jcuda("jcuda-simple", kernelSimple, a, b, result);

            // Run the complex computation with Java
            // Run the complex computation with JCuda
            jcuda("jcuda-complex", kernelComplex, a, b, result);

    private static void javaSimple(float a[], float b[], float result[])
        for (int i=0; i<a.length; i++)
            result** = a**+b**;
    private static void javaComplex(float a[], float b[], float result[])
        for (int i=0; i<a.length; i++)
            result** = (float)(Math.sin(a**)*Math.sin(a**)+Math.cos(a**)*Math.cos(a**) +
    private static void jcuda(String name, KernelLauncher kernelLauncher, float a[], float b[], float result[])

        // Allocate the device memory and copy the input
        // data to the device
        int size = a.length;
        CUdeviceptr dResult = new CUdeviceptr();
        cuMemAlloc(dResult, size * Sizeof.FLOAT);
        CUdeviceptr dA = new CUdeviceptr();
        cuMemAlloc(dA, size * Sizeof.FLOAT);
        cuMemcpyHtoD(dA,, size * Sizeof.FLOAT);
        CUdeviceptr dB = new CUdeviceptr();
        cuMemAlloc(dB, size * Sizeof.FLOAT);
        cuMemcpyHtoD(dB,, size * Sizeof.FLOAT);
        // Call the kernel
        int gridSize = size / 256;
        kernelLauncher.setGridSize(gridSize, 1);
        kernelLauncher.setBlockSize(256, 1, 1);
        Timer.startTimer(name);, dA, dB);

        // Copy the result from the device to the host
        cuMemcpyDtoH(, dResult, size * Sizeof.FLOAT);
        // Clean up


Thanks for your help;

Best Regards