About a better use of resources

Hi again!

I’ve executed a program that works with arrays (dimension 1).
When I want to execute a task I only call to cuLaunchKernel() as follow:

Normally I need about 500 threads
cuLaunchKernel(function, 1, 1, 1, numThreads, 1, 1, 0, null, args, null);

But I wondered whether, assuming that I need 1 million of threads, It’s good to do the same?

Or could I thinking that the performance can be better using the “gridDimX” argument? If it’s so, I don’t have an clear idea that how to do it.

I hope you understand.

Of course, the number of threads used per block is one important tuning parameter. There are some tricky/sophisticated methods for determining the “optimal” number of threads, but that depends on many factors and is hard (for me) do describe in general.

However, the usual pattern is

  • the problem size is given
  • the block size is chosen (e.g. 16, 32, 64, 128, 256 or 512)
  • the grid size (i.e. the number of required blocks) is computed as follows
int blockSizeX = 256;
int gridSizeX = (int)Math.ceil((double)problemSize / blockSizeX);
    gridSizeX,  1, 1,      // Grid dimension
    blockSizeX, 1, 1,      // Block dimension
    0, null,               // Shared memory size and stream
    kernelParameters, null // Kernel- and extra parameters

It is important to note that in the kernel itself, you have to check whether the current thread is inside the bounds of the problem:

extern "C"
__global__ void kernelFunction(int problemSize, ...)
    int threadID = blockIdx.x * blockDim.x + threadIdx.x;
    if (threadID < problemSize)
        // Do the work here

what about if we need to address two dimensions array ?

I’ve tried to modify “jCudaVectorAdd” which is originaly one dimension array into this

int blockSizeX = 256; int blockSizeY = 256;
int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
int gridSizeY = (int)Math.ceil((double)numElements / blockSizeY);
      gridSizeX,  gridSizeY, 1,                       // Grid dimension
      blockSizeX, blockSizeY, 1,      // Block dimension
      0, null,               // Shared memory size and stream
      kernelParameters, null // Kernel- and extra parameters

and the cu file into this :

__global__ void add(int n, float **a, float **b, float **sum)
    int h = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y * blockDim.y + threadIdx.y;
    if ((h<n) && (i<n))
        sum[h]** = a[h]** + b[h]**;


but still, seams I haven’t enough luck to do this >.<

the error code was :
Exception in thread “main” jcuda.CudaException: CUDA_ERROR_INVALID_VALUE

did I missed something ?

It’s not so straightforward to use “real” 2D arrays in CUDA. You may want to have a look at the “JCudaDriverSample.java” from the website, which uses 2D arrays: The pointers to the array have to be allocated and copied separately.

In general, it’s much easier to use 1D arrays and treat them as if they were 2D arrays. For a ‘pseudo-2D-array’ with size [sizeX*sizeY], one can access the array elements like this
float element = array[x + y * sizeX];

wow …
ok, thanks again marco

but in C++ it just “simply” using “dim3” class
doesn’t this “dim3” is available in jcuda ?

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) 
{ int i = threadIdx.x; 
   int j = threadIdx.y; 
   C**[j] = A**[j] + B**[j]; 
int main() 
{ ... 
  // Kernel invocation with one block of N * N * 1 threads 
  int numBlocks = 1; 
  dim3 threadsPerBlock(N, N); 
  MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

There is a class “jcuda.runtime.dim3”, but I assume that you are referring to the syntax of
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
right? In this case, you might want to have a look at the first section of the tutorial at http://jcuda.org/tutorial/TutorialIndex.html#Introduction about the difference of the Runtime API and the Driver API. Additionally, the “KernelLauncher” class from the Utilities package at http://jcuda.org/utilities/utilities.html might be interesting for you.

ya …
the utilities package is so “real”

thank you marco

aw …
the example file is raising :
Exception in thread “main” jcuda.CudaException: CUDA_ERROR_INVALID_SOURCE

is it not completed yet ?

The current version of the KernelLauncher still has a drawback: It uses CUBIN files. This has been discussed in this thread: http://forum.byte-welt.de/showthread.php?t=3347&page=2 . Admittedly, I’m scared that this is now already 8 months old :frowning: But finally: Yesterday I updated the KernelLauncher to use PTX files, this should solve this problem. It will be uploaded together with the updated version of JCuda for CUDA 4.1RC2, and I hope that I can do this on monday.

I’m looking forward for it :slight_smile:

nice to see support of ptx, i created a clone of kernellauncher overriding the load method to use ptx code.

The new version of the KernelLauncher in the Utilities has been uploaded at http://jcuda.org/utilities/utilities.html
Hope that everything works as expected.

Thanks Marco

could you give me example how to use two or three dimensional thread block ?
now, it’s possible with this kernel launcher isn’t ?

It was already possible before. The problem you mentioned was due to a “missing” compiler argument for the creation for the CUBIN file - which should now no longer be necessary, because the latest version is using PTX files. However, here is an example of using the KernelLauncher to add two 2D matrices using a 2D grid configuration:

 * JCudaUtils - Utilities for JCuda
 * http://www.jcuda.org
 * Copyright (c) 2010-2012 Marco Hutter - http://www.jcuda.org
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.utils.KernelLauncher;
import jcuda.utils.Print;

 * A sample demonstrating how the KernelLauncher class may
 * be used to compile inlined source code and execute a
 * kernel function from the source code. This example
 * adds two matrices using a 2D grid configuration. 
public class KernelLauncher2DSample
    public static void main(String args[])

        String sourceCode =
            "extern \"C\"" + "
" +
            "__global__ void add(float *a, float *b, float *result, int sizeX, int sizeY)" + "
" +
            "{" + "
" +
            "    int x = blockIdx.x*blockDim.x + threadIdx.x;" + "
" +
            "    int y = blockIdx.y*blockDim.y + threadIdx.y;" + "
" +
            "    if (x < sizeX && y < sizeY)" + "
" +
            "    {" + "
" +
            "        int index = x+y*sizeX;" + "
" +
            "        result[index] = a[index] + b[index];" + "
" +
            "    }" + "
" +

		// Prepare the kernel
		System.out.println("Preparing the KernelLauncher...");
		KernelLauncher kernelLauncher = 
				KernelLauncher.compile(sourceCode, "add");

		// Create the input data
		System.out.println("Creating input data...");
		int sizeX = 10;
		int sizeY = 5;
		float a[] = new float[sizeX * sizeY];
		float b[] = new float[sizeX * sizeY];
		float result[] = new float[sizeX * sizeY];
		for (int i = 0; i < sizeX * sizeY; i++) {
			a** = i;
			b** = i;

		// Allocate the device memory and copy the input
		// data to the device
		System.out.println("Initializing device memory...");
		CUdeviceptr dA = new CUdeviceptr();
		cuMemAlloc(dA, sizeX * sizeY * Sizeof.FLOAT);
		cuMemcpyHtoD(dA, Pointer.to(a), sizeX * sizeY * Sizeof.FLOAT);
		CUdeviceptr dB = new CUdeviceptr();
		cuMemAlloc(dB, sizeX * sizeY * Sizeof.FLOAT);
		cuMemcpyHtoD(dB, Pointer.to(b), sizeX * sizeY * Sizeof.FLOAT);
		CUdeviceptr dResult = new CUdeviceptr();
		cuMemAlloc(dResult, sizeX * sizeY * Sizeof.FLOAT);

		// Set up the block- and grid sizes and call the kernel
		System.out.println("Calling the kernel...");
		int blockSizeX = 16;
		int blockSizeY = 16;
		kernelLauncher.setBlockSize(blockSizeX, blockSizeY, 1);
		int gridSizeX = (int) Math.ceil((double) sizeX / blockSizeX);
		int gridSizeY = (int) Math.ceil((double) sizeY / blockSizeY);
		kernelLauncher.setGridSize(gridSizeX, gridSizeY, 1);
		kernelLauncher.call(dA, dB, dResult, sizeX, sizeY);

		// Copy the result from the device to the host
		System.out.println("Obtaining results...");
		cuMemcpyDtoH(Pointer.to(result), dResult, sizeX * sizeY * Sizeof.FLOAT);
" + Print.toString2D(result, sizeX, "%6.1f"));

		// Clean up