Cuda non void kernel function



Ich spreche Deutsch nicht. ok so klein:) Ich habe Deutsch vergessen…

The tutorial is nice at but what if the kernel is not a void function and returns double,float,etc? how can I get the result of the kernel in java code?

I guess this is a very simple question to be asked but not sure why it is not stated in the tutorial.

I would be grateful if you can help


extern "C"
__global__ int add(int n, float *a, float *b, float *sum)
    int sum=0;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
        sum = a[i] + b[i]+sum;
    return sum;


In CUDA, kernel functions are always void.

If you want to return a single value from a kernel, you have to write it into a “result” pointer that is passed to the kernel. Sketched here, as an example:

extern "C"
__global__ void computeSomething(, ..., int *result)
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    // ...
    if (index == 0) result[0] = 12345;

On host side, there is some boilerplate code involved: You have to allocate (device) memory for that, and afterwards, copy it back to the host.

// Allocate memory for the result (a single int value)
CUdeviceptr resultPointer = new CUdeviceptr();
cuMemAlloc(resultPointer, 1 * Sizeof.INT);

// Set up the kernel parameters. The last parameter
// is the "result" pointer
Pointer kernelParameters =

// Call the kernel function. This will write
// a single value into the "result" pointer
cuLaunchKernel(function, gx, 0, 0, bx, 0, 0, 
    0, null, kernelParameters, null);

// Copy the value from the result pointer to a host array
int resultArray[] = new int[1];
cuMemcpyDtoH(, resultPointer, 1 * Sizeof.INT);

This is cumbersome, indeed. But note that this is not related to JCuda. In CUDA-C, you basically have to do the same thing when using the driver API.

Also note that you have to be very careful about what you are doing with this (single!) result variable in the kernel. If you have to keep in mind that the kernel is executed by thousands of threads at once. You may have noticed that I wrote this in the kernel:

if (index == 0) result[0] = 12345;

If you just wrote

result[0] = someResult;

then all the threads would write to the same memory location, at the same time, which would basically produce unpredictable results…

(EDIT: BTW: Everything fine - the default language for the project support forums here is English :slight_smile: )


You are legend. Thank you Marco;)


I am using float and not sure if number of bits or bytes are required. Do I need to divide by 8.
I tried but the results are always 0.

another thing what is bx and gx. How to set them?


You need the number of bytes.

(You could also use Float.BYTES, but this was introduced in Java 1.8, which did not exist when JCuda was started)

gx and bx in the example are just the first element of the grid dimensions and block dimensions. You can read more about that in the documentation about the thread hierarchy.

(If you are not aware of that, once more the warning: Be careful that this “single variable” that you want to return from the kernel actually makes sense. What is the value that you want to compute in the kernel?)