The cudadevrt
library seems to be relevant when using dynamic parallelism (as shown in the PDF that you linked to, and the corresponding example at NVRTC 12.3 documentation ).
In any case: When the actual library (and its path) are required for the runtime compilation, then this approach might make very limited sense.
I experimented a bit, and the results are shown below. But two important disclaimers:
- I have not understood the NVRTC/JIT/dynamic parallelism in all detail
- Important: The example below uses the highly preliminary (and brittle…)
JITOptions
class. Nobody should rely on that for now.
The following kernel, referred to as simpleDynamicParallelismKernel.cu
from the main program, involves some very basic dynamic parallelism:
extern "C"
__global__ void childKernel(unsigned int parentThreadIndex, float* data)
{
data[threadIdx.x] = parentThreadIndex + 0.1f * threadIdx.x;
}
extern "C"
__global__ void parentKernel(unsigned int size, float *data)
{
childKernel<<<1, 8>>>(threadIdx.x, data + threadIdx.x * 8);
cudaDeviceSynchronize();
__syncthreads();
}
And the following program compiles and executes it on my machine:
package jcuda.jnvrtc.test;
import static jcuda.driver.CUjitInputType.CU_JIT_INPUT_LIBRARY;
import static jcuda.driver.CUjitInputType.CU_JIT_INPUT_PTX;
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuLinkAddData;
import static jcuda.driver.JCudaDriver.cuLinkAddFile;
import static jcuda.driver.JCudaDriver.cuLinkComplete;
import static jcuda.driver.JCudaDriver.cuLinkCreate;
import static jcuda.driver.JCudaDriver.cuLinkDestroy;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadDataEx;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUlinkState;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.driver.JITOptions;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;
/**
* A test for using the NVRTC (NVIDIA Runtime Compiler) API
* to compile CUDA kernel code that uses dynamic parallelism
*/
public class JNvrtcDynamicParallelismTest
{
/**
* Entry point of this sample
*
* @param args Not used
* @throws IOException
*/
public static void main(String[] args) throws IOException
{
// XXX: This library name has to be given for runtime compilation
// that involves dynamic parallelism
String libraryPath = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\lib\\x64\\";
String cudadevrtLibraryName = libraryPath + "cudadevrt.lib";
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
JNvrtc.setExceptionsEnabled(true);
String programSourceCode = new String(Files.readAllBytes(
Paths.get("simpleDynamicParallelismKernel.cu")));
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Use the NVRTC to create a program by compiling the source code
nvrtcProgram program = new nvrtcProgram();
nvrtcCreateProgram(
program, programSourceCode, null, 0, null, null);
String options[] =
{
"--gpu-architecture=compute_52"
};
nvrtcCompileProgram(program, options.length, options);
// Print the compilation log (for the case there are any warnings)
String programLog[] = new String[1];
nvrtcGetProgramLog(program, programLog);
System.out.println("Program compilation log:\n" + programLog[0]);
// Obtain the PTX ("CUDA Assembler") code of the compiled program
String[] ptx = new String[1];
nvrtcGetPTX(program, ptx);
nvrtcDestroyProgram(program);
byte[] ptxData = ptx[0].getBytes();
// Create the CUDA module from the PTX, using the JIT
// WARNING: The "JITOptions" class is preliminary
JITOptions jitOptions = new JITOptions();
CUlinkState state = new CUlinkState();
cuLinkCreate(jitOptions, state);
cuLinkAddFile(state, CU_JIT_INPUT_LIBRARY,
cudadevrtLibraryName, jitOptions);
cuLinkAddData(state, CU_JIT_INPUT_PTX,
Pointer.to(ptxData), ptxData.length, "input.ptx", jitOptions);
long size[] = { 0 };
Pointer image = new Pointer();
cuLinkComplete(state, image, size);
CUmodule module = new CUmodule();
cuModuleLoadDataEx(module, image,
0, new int[0], Pointer.to(new int[0]));
cuLinkDestroy(state);
// Obtain the function pointer to the "parentKernel" function
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "parentKernel");
// Define the nesting structure (note that the number
// of child threads must match the value that is used
// in the kernel)
int numParentThreads = 8;
int numChildThreads = 8;
// Allocate the device data that will be filled by the kernel
int numElements = numParentThreads * numChildThreads;
CUdeviceptr deviceData = new CUdeviceptr();
cuMemAlloc(deviceData, numElements * Sizeof.FLOAT);
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[] { numElements }),
Pointer.to(deviceData)
);
// Call the kernel function.
int blockSizeX = numParentThreads;
int gridSizeX = (numElements + numElements - 1) / blockSizeX;
cuLaunchKernel(function,
gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
cuCtxSynchronize();
// Copy the device data to the host
float hostData[] = new float[numElements];
for(int i = 0; i < numElements; i++)
{
hostData[i] = i;
}
cuMemcpyDtoH(Pointer.to(hostData),
deviceData, numElements * Sizeof.FLOAT);
// Compare the host data with the expected values
float hostDataRef[] = new float[numElements];
for(int i = 0; i < numParentThreads; i++)
{
for (int j=0; j < numChildThreads; j++)
{
hostDataRef[i * numChildThreads + j] = i + 0.1f * j;
}
}
System.out.println("Result: "+Arrays.toString(hostData));
boolean passed = Arrays.equals(hostData, hostDataRef);
System.out.println(passed ? "PASSED" : "FAILED");
// Clean up.
cuMemFree(deviceData);
}
}
To emphasize this again:
- The
JITOptions
that are necessary for setting up the JIT linker pass (specifically, to add the cudadevrt library by calling cuLinkAddFile
) are very brittle and may show unexpected behavior (this is also mentioned in the JavaDocs. This was one of the cases where the CUDA C API could not sensibly be mapped to Java…)
- The full, absolute library path is contained in the source code…
Given these severe limitations, the path of using CUBIN files could in fact be more sensible. I know that this may raise other questions. Mainly: CUBIN files are oddly specific for one device architecture. And if the goal is to support multiple architectures, it might be necessary to create multiple CUBIN files and load them for the device at hand, with some if (deviceIs(VERSION_5_2)) load("kernel_5_2.cubin");
calls. But
You mentioned
I’m trying to make kernels that work with C++ and java with minimal changes,
but this is not related to the CUBIN-vs-PTX issue that you mentioned before, right?