In CUDA, kernel functions are always void
.
If you want to return a single value from a kernel, you have to write it into a „result“ pointer that is passed to the kernel. Sketched here, as an example:
extern "C"
__global__ void computeSomething(, ..., int *result)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
// ...
if (index == 0) result[0] = 12345;
}
On host side, there is some boilerplate code involved: You have to allocate (device) memory for that, and afterwards, copy it back to the host.
// Allocate memory for the result (a single int value)
CUdeviceptr resultPointer = new CUdeviceptr();
cuMemAlloc(resultPointer, 1 * Sizeof.INT);
// Set up the kernel parameters. The last parameter
// is the "result" pointer
Pointer kernelParameters = Pointer.to(
...
Pointer.to(resultPointer)
);
// Call the kernel function. This will write
// a single value into the "result" pointer
cuLaunchKernel(function, gx, 0, 0, bx, 0, 0,
0, null, kernelParameters, null);
cuCtxSynchronize();
// Copy the value from the result pointer to a host array
int resultArray[] = new int[1];
cuMemcpyDtoH(Pointer.to(resultArray), resultPointer, 1 * Sizeof.INT);
This is cumbersome, indeed. But note that this is not related to JCuda. In CUDA-C, you basically have to do the same thing when using the driver API.
Also note that you have to be very careful about what you are doing with this (single!) result variable in the kernel. If you have to keep in mind that the kernel is executed by thousands of threads at once. You may have noticed that I wrote this in the kernel:
if (index == 0) result[0] = 12345;
If you just wrote
result[0] = someResult;
then all the threads would write to the same memory location, at the same time, which would basically produce unpredictable results…
(EDIT: BTW: Everything fine - the default language for the project support forums here is English )