Dynamic parallelism problem

I was using jcuda without problems, compiling code with nvrtcCompileProgram() and all worked perfectly.
CUDA project: https://git.elphel.com/Elphel/tile_processor_gpu/tree/master/src (it has a C++ top file for debugging and 2 other shared between C++ and jcuda)
Java code: https://git.elphel.com/Elphel/imagej-elphel/blob/lwir-distort/src/main/java/com/elphel/imagej/gpu/GPUTileProcessor.java#L1107

When I added CDP code (not yet in repo, it compiles/runs with in NSight/Eclipse), jcuda reported:
default_program(2275): error: identifier „cudaDeviceSynchronize“ is undefined

I then tried to compile into cubin:
/usr/local/cuda/bin/nvcc -I/usr/local/cuda/samples/common/inc/ -dlink -rdc=true -arch=sm_75 -cubin -c TileProcessor.cuh -o TileProcessor.cubin

It went ok - just got 2 legit warnings about unused variables. But when I tried loading it with

I’ve got Failed to initialize GPU class
jcuda.CudaException: CUDA_ERROR_INVALID_IMAGE

(with both cubin and ptx).

Maybe I’m doing something wrong with compilation - I never compiled code for jcuda but with nvrtcCompileProgram()?

nvrtcCompileProgram(program, 2, new String[] {"-arch=compute_75", „-rdc=true“});
Program compilation log:
default_program(2424): warning: variable „ports_rgb“ was declared but never referenced
default_program(2425): warning: variable „max_diff“ was declared but never referenced
Failed to initialize GPU class
jcuda.CudaException: CUDA_ERROR_INVALID_PTX

Hello,

I haven’t worked extensively with dynamic parallelism yet. There has been a question about how to use dynamic parallelism in general quite a while ago, and I remember that I ~„got something basically running“ - that’s what I posted at Dynamic parallelism

The compilation/linking procedure was a bit complicated, as far as I remember, but it has been 5 years ago, so I don’t remember any details. You might want to try out the example that I posted there (although things might have changed with the later CUDA versions).

I’ll also try it out again tomorrow, and have a closer look at the repos that you linked to.

bye
Marco

Marco, thank you, I’ll try just your example. So far I only tried to insert your CDP code into otherwise working kernel file, compiled by nvrtcCompileProgram() with the same options as above (without CDP same options work) and got CUDA_ERROR_INVALID_PTX again.

Andrey

Copied your example files and called them from myprogram:

Mar 29, 2020 9:49:50 PM com.elphel.imagej.gpu.JCudaSamplesUtils invokeNvcc
INFO: Creating cubin file for src/main/resources/kernels/JCudaDynamicParallelismKernel.cu
Mar 29, 2020 9:49:50 PM com.elphel.imagej.gpu.JCudaSamplesUtils invokeNvcc
INFO: Executing
Failed to initialize GPU class
jcuda.CudaException: Could not create cubin file
at com.elphel.imagej.gpu.JCudaSamplesUtils.invokeNvcc(JCudaSamplesUtils.java:176)
at com.elphel.imagej.gpu.JCudaSamplesUtils.prepareDefaultCubinFile(JCudaSamplesUtils.java:77)
at com.elphel.imagej.gpu.GPUTileProcessor.(GPUTileProcessor.java:271)
at com.elphel.imagej.correction.Eyesis_Correction.getPairImages2Gpu(Eyesis_Correction.java:5797)

That was a wrong path to nvcc, now:
Mar 29, 2020 10:13:49 PM com.elphel.imagej.gpu.JCudaSamplesUtils invokeNvcc
INFO: Creating cubin file for src/main/resources/kernels/JCudaDynamicParallelismKernel.cu
Mar 29, 2020 10:13:49 PM com.elphel.imagej.gpu.JCudaSamplesUtils invokeNvcc
INFO: Executing
Mar 29, 2020 10:13:50 PM com.elphel.imagej.gpu.JCudaSamplesUtils invokeNvcc
INFO: Finished creating cubin file

So the last output looks as if it was basically working, right?

Some side note: The approach of calling the NVCC from a Java program in order to create the CUBIN/PTX is used in the samples for convenience, but also because it was (to some extent) necessary in earlier versions of CUDA: There was no option of compiling a program at runtime (for example, a program that is read from a file or contained in a plain Java String).

But now, CUDA supports runtime compilation via the NVRTC ( https://docs.nvidia.com/cuda/nvrtc/index.html ). A basic sample showing this is in https://github.com/jcuda/jcuda-samples/tree/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples , namely the JNvrtcVectorAdd.java example.

(As mentioned above: The „dynamic parallelism“ example that I linked to has been created ~5 years ago, and I think that at that point, CUDA did not yet have the NVRTC…)

I have not yet tried how well the NVRTC and dynamic parallelism work together. And the first thing that I would try would be to update the example so that it does use the NVRTC. That could be a nice thing to add to the jcuda-samples, because it could be another sample for the NVRTC in general, and for dynamic parallelism in particular.

I can try to tackle this soon. (It should not take much time if it „just works“ - otherwise, I’d have to see how much time I can allocate for further experiments). If I manage to do this within the next days, I’d drop a note here.

Marco,

Yes, after I fixed the nvcc path it worked, strangely for me even without „-rdc=true“ option (I had to add it to Eclipse NSight to make it compile there). And only with cubin, not with ptx. I’ll try to see if I can generate cubin from my code, but it will be really nice to use NVRTC that I did not have any problems before trying CDP, and CDP seems to be very convenient to use. I’m trying to make kernels that work with C++ and java with minimal changes, and my image processing needs multiple passes, where next pass depends on the results of the previous. Without CDP number of kernel calls from the CPU grows making code less portable, java and cpp code has to be duplicated.

Andrey

Can it be the need for cudadevrt?

if (argc < 2) {
std::cout << "Usage: dynamic-parallelism <path to cudadevrt library>\n\n"
<< "library name itself, e.g., Z:\\path\\to\\cudadevrt.lib on \n"
<< "Windows and /path/to/libcudadevrt.a on Linux and Mac OS X.\n";
exit(1);
----
// Load the generated PTX and get a handle to the parent kernel.
CUdevice cuDevice;
CUcontext context;
CUmodule module;
CUfunction kernel;
CUDA_SAFE_CALL(cuInit(0));
CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));
CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice));
0, 0, 0));
(void *)ptx, ptxSize, "dynamic_parallelism.ptx",
0, 0, 0));


The cudadevrt library seems to be relevant when using dynamic parallelism (as shown in the PDF that you linked to, and the corresponding example at https://docs.nvidia.com/cuda/nvrtc/index.html#example-dynamic-parallelism ).

In any case: When the actual library (and its path) are required for the runtime compilation, then this approach might make very limited sense.

I experimented a bit, and the results are shown below. But two important disclaimers:

• I have not understood the NVRTC/JIT/dynamic parallelism in all detail
• Important: The example below uses the highly preliminary (and brittle…) JITOptions class. Nobody should rely on that for now.

The following kernel, referred to as simpleDynamicParallelismKernel.cu from the main program, involves some very basic dynamic parallelism:

extern "C"
__global__ void childKernel(unsigned int parentThreadIndex, float* data)
{
}

extern "C"
__global__ void parentKernel(unsigned int size, float *data)
{
}


And the following program compiles and executes it on my machine:

package jcuda.jnvrtc.test;

import static jcuda.driver.CUjitInputType.CU_JIT_INPUT_LIBRARY;
import static jcuda.driver.CUjitInputType.CU_JIT_INPUT_PTX;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JITOptions;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

/**
* A test for using the NVRTC (NVIDIA Runtime Compiler) API
* to compile CUDA kernel code that uses dynamic parallelism
*/
public class JNvrtcDynamicParallelismTest
{
/**
* Entry point of this sample
*
* @param args Not used
* @throws IOException
*/
public static void main(String[] args) throws IOException
{
// XXX: This library name has to be given for runtime compilation
// that involves dynamic parallelism
String libraryPath = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\lib\\x64\\";

// Enable exceptions and omit all subsequent error checks
JNvrtc.setExceptionsEnabled(true);

Paths.get("simpleDynamicParallelismKernel.cu")));

// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);

// Use the NVRTC to create a program by compiling the source code
nvrtcProgram program = new nvrtcProgram();
nvrtcCreateProgram(
program, programSourceCode, null, 0, null, null);
String options[] =
{
"--gpu-architecture=compute_52"
};
nvrtcCompileProgram(program, options.length, options);

// Print the compilation log (for the case there are any warnings)
String programLog[] = new String[1];
nvrtcGetProgramLog(program, programLog);
System.out.println("Program compilation log:\n" + programLog[0]);

// Obtain the PTX ("CUDA Assembler") code of the compiled program
String[] ptx = new String[1];
nvrtcGetPTX(program, ptx);
nvrtcDestroyProgram(program);
byte[] ptxData = ptx[0].getBytes();

// Create the CUDA module from the PTX, using the JIT
// WARNING: The "JITOptions" class is preliminary
JITOptions jitOptions = new JITOptions();

Pointer.to(ptxData), ptxData.length, "input.ptx", jitOptions);

long size[] = { 0 };
Pointer image = new Pointer();

CUmodule module = new CUmodule();
0, new int[0], Pointer.to(new int[0]));

// Obtain the function pointer to the "parentKernel" function
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "parentKernel");

// Define the nesting structure (note that the number
// of child threads must match the value that is used
// in the kernel)

// Allocate the device data that will be filled by the kernel
CUdeviceptr deviceData = new CUdeviceptr();
cuMemAlloc(deviceData, numElements * Sizeof.FLOAT);

// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[] { numElements }),
Pointer.to(deviceData)
);

// Call the kernel function.
int gridSizeX = (numElements + numElements - 1) / blockSizeX;
cuLaunchKernel(function,
gridSizeX,  1, 1,      // Grid dimension
blockSizeX, 1, 1,      // Block dimension
0, null,               // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
cuCtxSynchronize();

// Copy the device data to the host
float hostData[] = new float[numElements];
for(int i = 0; i < numElements; i++)
{
hostData[i] = i;
}
cuMemcpyDtoH(Pointer.to(hostData),
deviceData, numElements * Sizeof.FLOAT);

// Compare the host data with the expected values
float hostDataRef[] = new float[numElements];
for(int i = 0; i < numParentThreads; i++)
{
for (int j=0; j < numChildThreads; j++)
{
hostDataRef[i * numChildThreads + j] = i + 0.1f * j;
}
}
System.out.println("Result: "+Arrays.toString(hostData));
boolean passed = Arrays.equals(hostData, hostDataRef);
System.out.println(passed ? "PASSED" : "FAILED");

// Clean up.
cuMemFree(deviceData);

}
}


To emphasize this again:

• The JITOptions that are necessary for setting up the JIT linker pass (specifically, to add the cudadevrt library by calling cuLinkAddFile) are very brittle and may show unexpected behavior (this is also mentioned in the JavaDocs. This was one of the cases where the CUDA C API could not sensibly be mapped to Java…)
• The full, absolute library path is contained in the source code…

Given these severe limitations, the path of using CUBIN files could in fact be more sensible. I know that this may raise other questions. Mainly: CUBIN files are oddly specific for one device architecture. And if the goal is to support multiple architectures, it might be necessary to create multiple CUBIN files and load them for the device at hand, with some if (deviceIs(VERSION_5_2)) load("kernel_5_2.cubin"); calls. But

You mentioned

I’m trying to make kernels that work with C++ and java with minimal changes,

but this is not related to the CUBIN-vs-PTX issue that you mentioned before, right?

Marco, thank you.
I’ll try the „brittle“ way first, if it will not work - use CUBIN approach. And in any case I’ll post the results how it worked.

So far I used NSight to debug the kernel code. In Java I used compilation from the String, that I composed from several source files and added some java code-generated defines that took care of the differences between C++ (debug) and java (actual work) code. And before CDP it was very convenient - no manual commands - just pressing debug button in NSight, when it worked there - the same in Eclipse Java project. Application settings contained path to the kernel source directory (if not specified it used resource directory with the files distributed with the Java application).

„Minimal changes“ was the reason to use CDP (I did not know about its existence until I really needed it) to have the sequence (including loops) of calling dependent kernel functions embedded in the kernel code and either C++ or Java had to interface just a single kernel and its input/output data. Yes, it is a separate issue, it is just a reason to use CDP that in turn broke our perfectly working so far jcuda-based code.

Andrey

You seem to have a rather sophisticated use-case there. Of course, I’ve created JCuda and am maintaining it, but that’s still different from using it in real applications, so it’s interesting to hear more about that from a more practical perspective.

This also refers to the „infrastructure“ for actual development. For a while, it was technically possible to use some of the NVIDIA toolchain even for JCuda applications, like the profiler and debugger. But the NVIDIA profiler underwent some refactorings, suddenly didn’t work the way it did before (via Java), and I never managed to catch up with the latest state.

Debugging may be even more important. Debugging CUDA can be a pain in the back, and debugging it via JCuda is probably even harder. There is a very basic „Debugging“ section at http://jcuda.org/debugging/Debugging.html , but I have to admit that I never set up NSight for that. I should probably try that out. If you have any hints (or would like to write something like a „cookbook-tutorial“ with steps that are easy to follow and try out (and that could eventually go into one of the READMEs on GitHub)), I’d appreciate that

But again, regarding the „brittle“ approach: As I already mentioned, NVRTC wasn’t there from the beginning, and the JIT capabilities do not seem to be so widely used (and have some caveats, even in plain CUDA). There are certainly things related to the JITOptions that should work but plainly do not work. I’d also appreciate feedback here, but I’m already aware of some of the limitations, and cannot say for sure when I’ll have the time to tackle that part and bring it into a more usable and reliable shape…

Marco, your example after changing library path to Linux environment:
static String LIBRARY_PATH = „/usr/local/cuda/targets/x86_64-linux/lib/“;
// String libraryPath = „C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\“;
and
// „–gpu-architecture=compute_52“
„–gpu-architecture=compute_75“
works!

Program compilation log:

Result: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7]
PASSED

Andrey

Marco,

Thank you for the support of the very useful software. I’ll be happy to share our experience, but I’m not a real professional in neither Java, no CUDA. Having to deal with many aspects of the system design (mechanical/optical design, hardware/PCB, FPGA/Verilog, Linux kernel, ML/Tensorflow different types of applications, …) is very exciting, but limits the depth of knowledge in each of these areas. Here is a link to a blog with my first CUDA and jcuda experience - https://blog.elphel.com/2018/10/gpu-implementation-of-the-tile-processor/ where I tested the feasibility of the implementation of the Tile Processor (https://patents.google.com/patent/US20190098283A1/en - already approved). Then in 2018 I only implemented some functions, now I’m working on the rest comparing to the current Java implementation that is too slow to be useful (even using all of the available CPU threads). After that will be done, we will need to feed Tensorflow inferred model (separately it is tested in the same ImageJ application) directly from this our CUDA kernel from inside Java program.

The following are the arXiv links to what we are doing with this software (website software did not allow me to use more links - Google Scholar will find them):
arXiv:1911.06975

arXiv:1811.08032

Andrey

I did not yet test the CDP operation (there is yet nothing to test), but now my kernel with CDP loads correctly and non-CDP kernel functions work as before. Now I’ll start actual CDP development and see how it will work.

Thanks for all the information and pointers! That’s certainly a lot to read, although I’ll probably not be able to understand the details and background when it comes to the actual implementation.

(Usually, I consider performance bottlenecks as a challenge to first be tackled in plain Java: The Hotspot JIT usually does a great job. But image processing is basically the source of „GPGPU“ (and therefore, CUDA), so it’s likely worth a try…)

From a first quick glance some of this seems to be related to what NVIDIA offers in their NPP library - at least, I noticed the term „DCT“ appearing in the site that you linked to and the documentation at https://docs.nvidia.com/cuda/npp/group__image__quantization.html .

Marco, I did look at available code and I used DCT-II implementation in Nvidia samples as a reference when starting to write CUDA code. But there is not much available ready to use, and by „DCT“ they usually mean „DCT-II“, as in JPEG, but we needed DCT-IV/DST-IV to implement Complex Lapped Transform (similar to https://en.wikipedia.org/wiki/Modified_discrete_cosine_transform) and described in our blog post: https://blog.elphel.com/2018/01/complex-lapped-transform-bayer/ . For small overlapping windows CLT is ~twice more efficient than regular DFT.

It was not that difficult as the same algorithms we already implemented in both Java and Verilog for FPGA. With java it was loading all 48 threads in a Xeon-based workstation. While debugging CUDA I match intermediate results produced by Java code. Results differ somewhat as double are used in Java (it is faster than float) and float - in CUDA.

Andrey

Marco,

I’ve got another problem with nvrtc related to the size of ptx data. The same kernel builds/runs in NSight, but fails in jcuda+nvrts, failing at
cuLinkAddData(state, CU_JIT_INPUT_PTX, Pointer.to(ptxData), ptxData.length, „input.ptx“, jitOptions);
with CUDA_ERROR_INVALID_PTX. It seems that the error appears when the ptx size exceeds ~500K - I was trying to remove different unrelated kernels from the source, and with ptxData.length=411931 everything is OK, but with ptxData.length=539600 it fails.

Did you work with large PTX ?

Andrey

Hm. That’s a point where debugging becomes difficult. The JITOptions had been supposed to offer the options for obtaining info/error logs, via the CUjit_option.CU_JIT_ERROR_LOG_BUFFER fields. But setting them causes some CUDA_ERROR_INVALID_VALUE in cuLinkAddData. It might only be a trivial bug, but again: The JITOptions had been a first shot to emulate the (otherwise not very Java-friendly) void **optionValues, and they have not been tested extensively.

The CUDA_ERROR_INVALID_PTX is pretty common, and can have different reasons (as indicated by the documentation, which just says that „…indicates that a PTX JIT compilation failed.“ - this isn’t soo helpful here either…).The reasons may, for example, also be that too many registers are used, too much shared memory used, …

So: Are you sure that it is caused by the plain size of the PTX? A quick websearch involving keywords like ~„maximum ptx size“ did not yield any „obvious“ warning. From what you described, there might be a limit at 2^19=524288. Having PTX files of that size seems unusual for me, but I don’t know.

I’m not sure when I’ll be able to have a closer look at the JITOoptions. So possible debugging steps might be to manually compile the ptx, and see whether

ptxas --verbose TheFile.ptx


prints anything helpful. Otherwise, I’d try to create a minimal example in plain C that does nothing except for the boilerplate stuff that leads to the cuLinkAddData call, and see whether it also fails. (This is usually my first test when I want to figure out whether an issue is related to JCuda, or whether it’s a general (possibly undocumented) issue/limitation of CUDA…)

Sorry for the hassle.

An aside: There is also the option of using „fat binaries“ ( https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#fatbinaries ), which are somewhere between CUBIN and PTX in terms of flexibility. But admittedly, that’s something that I haven’t tried out at all, so I’m just leaving this here as a pointer.

Marco,

I also tried to find if anybody had similar problems and could not find anything. Testing with C would require to use nvrtc from it - I did not have problems with nvcc with the same files. And I’m almost sure it is not something jcuda-specific, but rather JIT.

I’ll do more testing, bur it seems that the problem is with the ptx size. The source consists of several global function that were called from the CPU (only recently I started to combine some of them using CDP). And when I remove different unrelated functions - everything works, so it can not be caused by the shared memory/registers. I looked at (and diff-ed) the generated ptx code - it seems OK and ends with the same code for both working (smaller) and non-working (larger) sizes, so it is not truncated. And the error itself happens during linking ( cuLinkAddData), not during the compilation.

I’m thinking of trying to split the file into separate compilation units - the file is already large, so splitting will improve the code. Have you tried separate compilatin with nvrtc in jcuda? Should I anticipate any problems with it?

I’ll first finish debugging of the current code (while it is still in my „cache memory“). I did have a function that was replaced by a more generalized one, and now after debugging the replacement I removed the older one that gave me some 15% of extra room. After that done, I’ll split the working code trying not to harm optimization mentioned in CUDA documentation.

Again, my experience here is very limited (I mean, practical experience, beyond basic tests that I did a while ago). But if it is possible to split the code into multiple, independent PTX files, this sounds like an „obvious“ way to circumvent any size limit that CUDA/JIT might have under the hood: In the best case, it should be possible to do several cuLinkAddData calls, and wrap them up with a cuLinkComplete.

(I wonder whether/how it could be able to resolve dependencies that can not be avoided. But I’ll definitely have to read more at https://docs.nvidia.com/cuda/parallel-thread-execution/index.html and https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html about this particular topic before I can give further hints…)

As for the test with plain CUDA: When you say that you did not have a problem with nvcc, then it indeed
might be a problem with JIT/nvrtc in general - but I’m still not sure whether it might be caused by a bug in JCuda. So I’d probably start with the CUDA Samples\v10.2\6_Advanced\ptxjit (which, by the way, is the only sample that even calls cuLinkAddData…), and see whether it also bails out when receiving the large PTX file. It seems to have one function, ptxJIT, that can easily be called with an arbitrary PTX, for testing. (I wonder where I should get a >500KB PTX file from, but in doubt, I’d create that one with auto-generated „dummy“ functions…). Maybe I can try that in the next few days, unless you find a suitable workaround.

Marco, thank you. I’ll keep you posted on my results, I will first finish debugging of the last kernel itself, using extra room made by disabling unused function, then try to split to several compilation units according to https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/.
I’ll do that splitting only after the code itself will be tested, so I’ll be able to compare performance. Maybe I’ll need to duplicate some low-level functions to include in multiple compilation units. But I do not expect it - anyway I’m sure the kernel code can be improved, but even as is it gives me a huge improvement over the CPU.
I too believe it is possible to reproduce the problem with plain CUDA nvrtc, it is unlikely to be JCuda-specific. And if separate compilation will work - the problem will be solved, and a push to limit compilation unit size is good to improve code quality anyway, I just was lazy to investigate that posiblity, and originally kernel was not that big to bother.

Andrey