Of course, the number of threads used per block is one important tuning parameter. There are some tricky/sophisticated methods for determining the “optimal” number of threads, but that depends on many factors and is hard (for me) do describe in general.
However, the usual pattern is
the problem size is given
the block size is chosen (e.g. 16, 32, 64, 128, 256 or 512)
the grid size (i.e. the number of required blocks) is computed as follows
int blockSizeX = 256;
int gridSizeX = (int)Math.ceil((double)problemSize / blockSizeX);
cuLaunchKernel(function,
gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
It is important to note that in the kernel itself, you have to check whether the current thread is inside the bounds of the problem:
extern "C"
__global__ void kernelFunction(int problemSize, ...)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;
if (threadID < problemSize)
{
// Do the work here
}
}
It’s not so straightforward to use “real” 2D arrays in CUDA. You may want to have a look at the “JCudaDriverSample.java” from the website, which uses 2D arrays: The pointers to the array have to be allocated and copied separately.
In general, it’s much easier to use 1D arrays and treat them as if they were 2D arrays. For a ‘pseudo-2D-array’ with size [sizeX*sizeY], one can access the array elements like this
float element = array[x + y * sizeX];
There is a class “jcuda.runtime.dim3”, but I assume that you are referring to the syntax of
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
right? In this case, you might want to have a look at the first section of the tutorial at http://jcuda.org/tutorial/TutorialIndex.html#Introduction about the difference of the Runtime API and the Driver API. Additionally, the “KernelLauncher” class from the Utilities package at http://jcuda.org/utilities/utilities.html might be interesting for you.
The current version of the KernelLauncher still has a drawback: It uses CUBIN files. This has been discussed in this thread: http://forum.byte-welt.de/showthread.php?t=3347&page=2 . Admittedly, I’m scared that this is now already 8 months old But finally: Yesterday I updated the KernelLauncher to use PTX files, this should solve this problem. It will be uploaded together with the updated version of JCuda for CUDA 4.1RC2, and I hope that I can do this on monday.
It was already possible before. The problem you mentioned was due to a “missing” compiler argument for the creation for the CUBIN file - which should now no longer be necessary, because the latest version is using PTX files. However, here is an example of using the KernelLauncher to add two 2D matrices using a 2D grid configuration:
/*
* JCudaUtils - Utilities for JCuda
* http://www.jcuda.org
*
* Copyright (c) 2010-2012 Marco Hutter - http://www.jcuda.org
*/
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.utils.KernelLauncher;
import jcuda.utils.Print;
/**
* A sample demonstrating how the KernelLauncher class may
* be used to compile inlined source code and execute a
* kernel function from the source code. This example
* adds two matrices using a 2D grid configuration.
*/
public class KernelLauncher2DSample
{
public static void main(String args[])
{
JCudaDriver.setExceptionsEnabled(true);
String sourceCode =
"extern \"C\"" + "
" +
"__global__ void add(float *a, float *b, float *result, int sizeX, int sizeY)" + "
" +
"{" + "
" +
" int x = blockIdx.x*blockDim.x + threadIdx.x;" + "
" +
" int y = blockIdx.y*blockDim.y + threadIdx.y;" + "
" +
" if (x < sizeX && y < sizeY)" + "
" +
" {" + "
" +
" int index = x+y*sizeX;" + "
" +
" result[index] = a[index] + b[index];" + "
" +
" }" + "
" +
"}";
// Prepare the kernel
System.out.println("Preparing the KernelLauncher...");
KernelLauncher kernelLauncher =
KernelLauncher.compile(sourceCode, "add");
// Create the input data
System.out.println("Creating input data...");
int sizeX = 10;
int sizeY = 5;
float a[] = new float[sizeX * sizeY];
float b[] = new float[sizeX * sizeY];
float result[] = new float[sizeX * sizeY];
for (int i = 0; i < sizeX * sizeY; i++) {
a** = i;
b** = i;
}
// Allocate the device memory and copy the input
// data to the device
System.out.println("Initializing device memory...");
CUdeviceptr dA = new CUdeviceptr();
cuMemAlloc(dA, sizeX * sizeY * Sizeof.FLOAT);
cuMemcpyHtoD(dA, Pointer.to(a), sizeX * sizeY * Sizeof.FLOAT);
CUdeviceptr dB = new CUdeviceptr();
cuMemAlloc(dB, sizeX * sizeY * Sizeof.FLOAT);
cuMemcpyHtoD(dB, Pointer.to(b), sizeX * sizeY * Sizeof.FLOAT);
CUdeviceptr dResult = new CUdeviceptr();
cuMemAlloc(dResult, sizeX * sizeY * Sizeof.FLOAT);
// Set up the block- and grid sizes and call the kernel
System.out.println("Calling the kernel...");
int blockSizeX = 16;
int blockSizeY = 16;
kernelLauncher.setBlockSize(blockSizeX, blockSizeY, 1);
int gridSizeX = (int) Math.ceil((double) sizeX / blockSizeX);
int gridSizeY = (int) Math.ceil((double) sizeY / blockSizeY);
kernelLauncher.setGridSize(gridSizeX, gridSizeY, 1);
kernelLauncher.call(dA, dB, dResult, sizeX, sizeY);
// Copy the result from the device to the host
System.out.println("Obtaining results...");
cuMemcpyDtoH(Pointer.to(result), dResult, sizeX * sizeY * Sizeof.FLOAT);
System.out.println("Result:
" + Print.toString2D(result, sizeX, "%6.1f"));
// Clean up
cuMemFree(dA);
cuMemFree(dB);
cuMemFree(dResult);
}
}