cuLaunchKernel throws CUDA_ERROR_INVALID_VALUE

I am trying to do some Matrix multiplication with JCuda but I am stuck at this error:

	at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
	at jcuda.driver.JCudaDriver.cuLaunchKernel(JCudaDriver.java:17119)
	at oroarmor.jcuda.JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:77)

My main code:

public class JCudaMatrixMultiply {
	public static void main(String[] args) {
		// Enable exceptions and omit all subsequent error checks
		JCudaDriver.setExceptionsEnabled(true);

		// Initialize the driver and create a context for the first device.
		cuInit(0);
		CUdevice device = new CUdevice();
		cuDeviceGet(device, 0);
		CUcontext context = new CUcontext();
		cuCtxCreate(context, 0, device);

		// Create the PTX file by calling the NVCC
		String ptxFileName = JCudaSamplesUtils.preparePtxFile("src/main/resources/kernels/matrixMultiply.cu");

		// Load the ptx file.
		CUmodule module = new CUmodule();
		cuModuleLoad(module, ptxFileName);

		// Obtain a function pointer to the "add" function.
		CUfunction function = new CUfunction();
		cuModuleGetFunction(function, module, "multiply");

		int n = 1 << 10;
		int mSize = n * n;

		System.out.println(mSize);

		int[] hostA = new int[mSize];
		int[] hostB = new int[mSize];

		for (int i = 0; i < mSize; i++) {
			hostA[i] = 1;
			hostB[i] = 1;
		}

		CUdeviceptr deviceAPtr = new CUdeviceptr();
		cuMemAlloc(deviceAPtr, mSize * Sizeof.INT);
		cuMemcpyHtoD(deviceAPtr, Pointer.to(hostA), mSize * Sizeof.INT);

		CUdeviceptr deviceBPtr = new CUdeviceptr();
		cuMemAlloc(deviceBPtr, mSize * Sizeof.INT);
		cuMemcpyHtoD(deviceBPtr, Pointer.to(hostB), mSize * Sizeof.INT);

		CUdeviceptr deviceO = new CUdeviceptr();
		cuMemAlloc(deviceO, mSize * Sizeof.INT);

		Pointer kernelParameters = Pointer.to(Pointer.to(new int[] { n }), Pointer.to(deviceAPtr),
				Pointer.to(deviceBPtr), Pointer.to(deviceO));

//		int blockSizeX = 512;
//		int gridSizeX = (int) Math.ceil((double) n / blockSizeX);

		cuLaunchKernel(//
				function, 2, 2, 1, // Grid dimension

				512, 512, 1, // Block dimension
				0, null, // Shared memory size and stream
				kernelParameters, null // Kernel- and extra parameters
		);
		cuCtxSynchronize();

		int hostO[] = new int[mSize];

		for (int i = 0; i < mSize; i++) {
			hostO[i] = -1;
		}
		cuMemcpyDtoH(Pointer.to(hostO), deviceO, mSize * Sizeof.INT);

		for (int i = 0; i < n / (1 << 7); i++) {
			System.out.print("[ ");
			for (int j = 0; j < n / (1 << 7); j++) {
				System.out.print(hostO[i * n + j] + ", ");
			}
			System.out.print(" ] [ ");
			for (int j = 0; j < n / (1 << 7); j++) {
				int sum = 0;
				for (int k = 0; k < n; k++) {
					sum += hostA[i * n + k] * hostB[j + k * n];
				}

				System.out.print(sum + ", ");
			}
			System.out.println(" ]");
		}

		System.out.println("finished");
	}

}

My kernel:

extern "C"
__global__ void multiply(int n, int *a, int *b, int *sum)
{

    int i = blockIdx.x * blockDim.x + threadIdx.x; //output x
    int j = blockIdx.y * blockDim.y + threadIdx.y; //output y

	int product = 0;
	
    if (i < n && j < n)
    {
    	for(int k = 0; k < n; k++){
        	product += a[i + k * n] * b[j * n + k];
        }
    }
	sum[i+j*n] = 3;
}

JCudaSamplesUtils.preparePtxFile(fileName) is from the sample code. What am I doing wrong so that I can fix this? Thank you!

There is a limit of the maximum number of threads that can be contained in one block. You can run the JCudaDeviceQueryV8 sample (or the native NVIDIA CUDA deviceQuery executable) to obtain this information. It prints a lot of details, but the first few rows are the most relevant here: For my GTX 970, it prints

Found 1 devices
Device 0: GeForce GTX 970 with Compute Capability 5.2
    Maximum number of threads per block                  : 1024
    Maximum x-dimension of a block                       : 1024
    Maximum y-dimension of a block                       : 1024
    Maximum z-dimension of a block                       : 64
    Maximum x-dimension of a grid                        : 2147483647
    Maximum y-dimension of a grid                        : 65535
    Maximum z-dimension of a grid                        : 65535
    Maximum shared memory per thread block in bytes      : 49152
    ...

Note that the maximum dimension of a block in x- or y-direction is 1024. But the first point, “Maximum number of threads per block”, is also 1024.

So when you have a 2D kernel, this basically means that you could use the following block sizes:

 1 * 1024
 2 * 512
 4 * 256
 8 * 128
16 * 64
32 * 32

But not 512 * 512, as in your case.

So when you assume a maximum block size of 32*32, you could compute the appropriate block- and grid sizes like this:

int blockSize = 32;
int gridSize = (int) Math.ceil((double) n / blockSize);
cuLaunchKernel(
    function, gridSize, gridSize, 1, // Grid dimension
    blockSize, blockSize, 1, // Block dimension
    0, null, // Shared memory size and stream
    kernelParameters, null // Kernel- and extra parameters
);

(The grid size can also be computed with

int gridSize = (n + blockSize - 1) / blockSize;

but the ceil may be more readable here…)


In a “real” application, you could/should determine the device limits for that at runtime. Basically query cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK and obtain the maximum number of threads per block. For a 2D kernel, the block size could then be the square root of this value (although usually, it will be a power of 2). But I assume that this program is in order to get started with CUDA. If you really only wanted a matrix multiplication, you’d likely use JCublas (even though it only operates on float, and not int, as in your case).