Dynamic parallelism problem

Marco, so far it did not work out of the box. I split code into two compilation units, it does work with nvcc. With nvrtc in JCuda I compiled both to ptx OK (replaced includes with merging the code to a string), added library and ptx files

cuLinkAddFile(state, CU_JIT_INPUT_LIBRARY, LIBRARY_PATH, jitOptions);
for (int cunit = 0; cunit < ptxDataUnits.length; cunit++) {
cuLinkAddData(state, CU_JIT_INPUT_PTX, Pointer.to(ptxDataUnits[cunit]), ptxDataUnits[cunit].length, „input“+cunit+".ptx", jitOptions);
}
long size[] = { 0 };
Pointer image = new Pointer();
cuLinkComplete(state, image, size);

And it failed at cuLinkComplete() with very informative message:

Failed to initialize GPU class
jcuda.CudaException: CUDA_ERROR_UNKNOWN
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuLinkComplete(JCudaDriver.java:2768)
at com.elphel.imagej.gpu.GPUTileProcessor.createFunctions(GPUTileProcessor.java:1458)

Andrey

Just to make sure the code modifications did not break something, I combined the same source files into a single compilation unit:

// „*“ - Java-generated defines, first index - separately compiled unit
// static String [][] GPU_SRC_FILES = {{"*",„dtt8x8.h“,„dtt8x8.cu“},{"*",„dtt8x8.h“,„TileProcessor.cuh“}};
static String [][] GPU_SRC_FILES = {{"*",„dtt8x8.h“,„dtt8x8.cu“,„TileProcessor.cuh“}};

And that worked correctly

It’s hard to guess the possible reason for a CUDA_ERROR_UNKNOWN (And I hope that you’re not wasting too much of your time with debugging something that might plainly a bug in JCuda at that point…).

Is there any chance that I can try this out locally? Again, one of my first tests in such a case is always to try out the same in a native example, but testing the Java version could also be helpful. Is this more than trying to compile/link the current files at https://git.elphel.com/Elphel/tile_processor_gpu/tree/master/src ?

Marco,

It will be difficult to test everything completely, as there are large data files, but should be enough to test compilation.

The master branch is before splitting (I left it for comparison), the separate compilation is in https://git.elphel.com/Elphel/tile_processor_gpu/tree/separate-compilation

The code that uses JCuda to compile/link is here:

Switch between single compilation and split is at line 95 of the same file, commented out is for separate (causes error),
next line - single ptx (working)

Andrey

The test_tp.cu file has dependency on /usr/local/cuda/samples/common/inc/ (on my computer).

I’ll put this on my task list, and try to allocate some time to test it. (I cannot promise anything right now, but maybe that’s also a chance to review/improve the stuff around the JITOptions…)

Marco,

Thank you. It is not too urgent for me right now - while converting the code for separate compilation I removed some unneeded parts and reduced ptx size to below 400K (when compiled together), so there is some room to add code that I need. Some of the „#pragma unroll“ and inline functions declarations may be extra, removing them will also reduce PTX size.

Andrey

Hello,

So I tried this locally. I’ve set up the https://git.elphel.com/Elphel/imagej-elphel project (I had to do some workaround in Eclipse so that it found the tools.jar - it had to be started with a dedicated path name so that it was run with a JDK, not a JRE). Also, because I usually have the latest CUDA version installed (for obvious reasons) updated to 10.2.0 in the POM (there was a comment indicating that 9.2 was required for some TensorFlow compatibility only).

But I’ve see that you seem to have switched off the „separate complilation“ attempt in the current version. Now, I could try to generate .CU files that are filled with dummy functions so that they are intentionally too large, and then compare the behavior of the Java version and a native version that only compiles+loads these files, but I’m not sure whether this is a sensible way to go.

According to your last comment, it seems like you could work around this by making the PTX smaller. And I think I (have to) consider this now as a „low-priority“ issue that I might look into when it might become a blocker.

BTW: I had a short look to see how JCuda is actually used there. The Eyesis_Correction has a main and can be started, but I don’t know exactly what I can do with that. The Eyesis_Correction.java is a file that has more than 10000 lines of code. I could probably spend years just trying to understand that single file. And that’s not even close to the point where the GPUTileProcessor is used. That’s done from another class, and frankly, method signatures like

public double [][][][][][] clt_aberrations_quad_corr(
        final ImageDttParameters  imgdtt_params,   
        final int                 macro_scale,     
        final int [][]            tile_op,         
        final double [][]         disparity_array, 
        final double [][][]       image_data, 
        final boolean [][]        saturation_imp, 
        final double [][][][]     clt_corr_combo,  
        final double [][][][][]   clt_corr_partial,
        final double [][]         clt_mismatch,    
        final double [][]         disparity_map,   
        final double [][][][]     texture_tiles,   
        final int                 width,
        final double              corr_fat_zero,    
        final boolean             corr_sym,
        final double              corr_offset,
        final double              corr_red,
        final double              corr_blue,
        final double              corr_sigma,
        final boolean             corr_normalize,  
        final double              min_corr,        
        final double              max_corr_sigma,  
        final double              max_corr_radius, 
        final boolean             max_corr_double, 
        final int                 corr_mode, 
        final double              min_shot,        
        final double              scale_shot,      
        final double              diff_sigma,      
        final double              diff_threshold,  
        final boolean             diff_gauss,      
        final double              min_agree,       
        final boolean             dust_remove,     
        final boolean             keep_weights,    
        final GeometryCorrection  geometryCorrection,
        final GeometryCorrection  geometryCorrection_main, 
        final double [][][][][][] clt_kernels, 
        final int                 kernel_step,
        final int                 transform_size,
        final int                 window_type,
        final double [][]         shiftXY, 
        final double              disparity_corr, 
        final double [][][]       fine_corr, 
        final double              corr_magic_scale, 
        final double              shiftX, 
        final double              shiftY, 
        final int                 debug_tileX,
        final int                 debug_tileY,
        final boolean             no_fract_shift,
        final boolean             no_deconvolution,
        final int                 threadsMax,  
        final int                 globalDebugLevel)

are impressive, but definitely not the good kind of impressive…

So on the one hand, I’m curious, and would like to know where the performance bottlenecks are, and how they are solved with JCuda. But it seems like the efforts to zoom into the right part of the code here are prohibitively large.

However, if you encounter any further issues with JCuda, just let me know.

(And if there’s a magic place with these lines…

public static void main(String args[]) {
    runWithCPU("inputImage.png");
    runWithGPU("inputImage.png");
}

that can be used as an entry point for further code browsing (and profiler runs and such), that would be great…)

bye
Marco

Marco,

I’m sorry - code is far from perfect and it is not yet easy to run it. What is doable - compile+link it with nvrtc : https://git.elphel.com/Elphel/imagej-elphel/blob/lwir-distort/src/main/java/com/elphel/imagej/gpu/GPUTileProcessor.java#L95

/*	static String [][] GPU_SRC_FILES = {
	{"*","dtt8x8.h","dtt8x8.cu"},
	{"\*","dtt8x8.h","geometry_correction.h","geometry_correction.cu","TileProcessor.h","TileProcessor.cuh"}};
*/
	static String [][] GPU_SRC_FILES = {{"\*","dtt8x8.h","dtt8x8.cu","geometry_correction.h","geometry_correction.cu","TileProcessor.h","TileProcessor.cuh"}};

Commented out uses two compilation units (fails linking), active - just one (links OK). The PTX size is now 396636 bytes, even as I added more code. the „*“ means text generated by getTpDefines() of the same file.

I also noticed that nvrtc does not like unions or anonymous struct (or both):

struct gc {
	float pixelCorrectionWidth; //  =2592;   // virtual camera center is at (pixelCorrectionWidth/2, pixelCorrectionHeight/2)
	float pixelCorrectionHeight; // =1936;
	float line_time;        // duration of one scan line readout (for ERS)
	float focalLength;      // =FOCAL_LENGTH;
	float pixelSize;        // =  PIXEL_SIZE; //um
	float distortionRadius; // =  DISTORTION_RADIUS; // mm - half width of the sensor
#ifndef	NVRTC_BUG
	union {
		struct {
#endif
			float distortionC;      // r^2
			float distortionB;      // r^3
			float distortionA;      // r^4
			float distortionA5;     //r^5
			float distortionA6;     //r^6
			float distortionA7;     //r^7
			float distortionA8;     //r^8
#ifndef	NVRTC_BUG
		};
		float rad_coeff [7];
	};
#endif
	// parameters, common for all sensors
	float    elevation;     // degrees, up - positive;
	float    heading;       // degrees, CW (from top) - positive
...

and does not have offsetof() macro.

But in any case - I was able to implement all the needed functionality, the main feature required from JCuda was Dynamic Parallelization that you helped me with.

I plan to write a blog post about this project. This technology was developed for the FPGA implementation - we just received USPTO notification that our patent („Method for the FPGA-Based Long Range Multi-View Stereo with Differential Image Rectification“) will be issued later this month. The multi-threaded Java code was made from the Verilog code of the FPGA, then converted to CUDA (FPGA we have in the camera has by far insufficient resources).
The raw results are below:

GPU run time =510.485116ms (CPU with 8 threads showing 100% each would be over 60sec)

 - rot/derivs:        0.032561ms
 - tasks setup:       3.998356ms
 - direct conversion: 22.591752ms
 - imclt:             16.254792ms
 - corr2D:            22.388543ms
 - textures:          210.814ms
 - RGBA:              234.404205ms

(Just a short note: The forum software found it suspicious that a new user repeatedly posted similar links, and that triggered a spam filter - I have increased your „Trust Level“, hopefully this will prevent further issues with that. I’ll take a closer look at your post ASAP)

Sure, code that is developed in the context of research looks different than the bread-and-butter code in other areas. And you already mentioned that this project is in the intersection of several cutting-edge technologies. I just wanted to mention that it’s hard to give focused advice in these cases, unless it is isolated into a small, standalone test.

But I created a main that just creates a new instance of GPUTileProcessor, and iff I understood this correctly, then this could already be sufficient to reproduce the actual error, which shows up at src/main/java/com/elphel/imagej/gpu/GPUTileProcessor.java · lwir-distort · Elphel / imagej-elphel · GitLab - I’d just have to switch the /* comments */ on/off accordingly. This might be a starting point for investigating this further.

The numbers that you posted certainly look promising. I’d really be curious about the exact benchmarking code behind this. In CUDA, you always have to take into account the timing for memory copies and computations, and on top of that, the asynchronous nature of CUDA. In Java, the Hotspot-JIT may make benchmarking really difficult. But I think that it would be hard (or for me: impossible) to carve out a direct comparison of the CPU-vs-GPU approaches into a dedicated test case.

You said

The multi-threaded Java code was made from the Verilog code of the FPGA

Of course I cannot argue about that on a technical level: I have no idea about the requirements or constraints that may stem from the Verilog source. But I can imagine that the resulting code is not optimized thoroughly (or maybe not at all) for the Java environment. There could be language constructs in Verilog that (very roughly speaking) can easily be translated to Java („verbatim“), but the result is not really „idiomatic (efficient) Java“. Again, this is only a guess, from quickly skimming over code that is far too complex for me to understand. But I see things like the code block around src/main/java/com/elphel/imagej/gpu/GPUTileProcessor.java · lwir-distort · Elphel / imagej-elphel · GitLab
, with many short-lived threads, atomics, and new (allocations) in inner loops (which are likely performance-critical), and I think that some of this could be optimized (and probably simplified) if one understood thoroughly what the code is actually doing, and wrote it with the goal of creating an efficient, Java-based implementation.

But on the other hand: It may just not be worth it. Even if the Java code was 10 times faster (~6 seconds instead of 60), it would still be nowhere near the GPU-based approach.

Marco, when I mentioned Verilog as an original source, I meant that the algorithms were initially designed to be massively parallel, so when simulating them in Java it was natural to use many threads in parallel. Moving to GPU it was increasing the parallelization again and it seems to be possible to have it running real-time (30-60fps) for lower resolution thermal imagers (640x512) - see a possible application - „Invisible Headlights“. Our current code is huge, covers many aspects of camera calibration, lens matching in multi-view cameras, aberration correction, there many one-time functions for research. After we’ll test the functionality using a normal workstation (among big things there will be combining this GPU code with Tensorflow (https://git.elphel.com/Elphel/tensorflow-feed-from-gpu) we’ll try to move the GPU code and re-implement a subset of CPU code to Jetson platform to make it wearable. And if all that will work, we’ll eventually go back to Verilog for a custom ASIC that will combine the Tile Processor and ML functionality.

Current code that you referenced is not the final for use, it is just for GPU kernels development, so I add and test each kernel separately. I’ll have to re-write the top Java code anyway to optimize data transfers between the input from the cameras (now they are just files) CPU memory, custom GPU code and the Tensorflow network inference. The Tile Processor operates on image tiles (16x16 with 8x8 pixel pitch), and not always the full image has to be processed, so on the input GPU kernel receives an array (containing either only tiles to process or being sparse of the full 324x242 with 0-s in the unneeded elements). The typical depth map generation includes multiple passes, next passes may refine data for only a subset of tiles, so overall optimization may be tricky.

Related question: is it possible to get linker logs with jitOptions in JCuda? May be the linker is able to complain with something more clear than „unknown error 999“?

If you are OK to load huge files, you may have a look at the sample images processed with this (GPU+JCuda) software - I uploaded some of them to https://community.elphel.com/files/gpu/

Andrey

I had a short look at the images, but I don’t know what I’m seeing there exactly. Maybe it would be clearer if I had already taken the time to take a closer look at the blog posts that you linked to earlier, but that’s still on my TODO list…

Regarding the parallelization in general: Certain aspects and probably be written generically. But some things like occupancy, vectorization or shared memory (which may have a huge impact on performance) are so specific for CUDA that it’s hard to anticipate that in a generic parallelization approach (although I’ve seen some __shared__ variables in the CUDA code).

As for the parallelization in Java: Some people consider threads (and the low-level synchronization functions wait/notify) as „too low level“ to be used in „normal“ applications. Now, you’re certainly not developing a „normal“ application there, that’s for sure, but for example, creating and finishing threads can be expensive, so even if the goal is to squeeze out the last bit of performance, a thread pool (aka Executors::newFixedThreadPool) could be beneficial. In many other cases, one can trivially „make things parallel“ by just wrapping the execution into a parallel Stream. But understand that you have very special constraints and requirements. When you say that all this should eventually go back to an ASIC, there may be no point in using an abstract, high-level concurrency framework for this kind of research.

(Totally unrelated: I stumbled over Index of /files/gltf_test , and wondered what glTF was supposed to be used for here…)

Related question: is it possible to get linker logs with jitOptions in JCuda? May be the linker is able to complain with something more clear than „unknown error 999“?

One of the main issues with the JITOptions that I mentioned earlier is that it is not („reliably“) possible to obtain logs. I have to re-read the CUDA docs at this point. The way how the void** optionValues are used is not entirely straightforward, and I remember that I introduced the JITOptions as a last resort to emulate this peculiar void** pointer in Java, but don’t remember the details right now. I can try to increase the priority of that, and see whether/how I can make the JITOptions more usable.

BTW: I had not addressed this:

I also noticed that nvrtc does not like unions or anonymous struct (or both):

(although, once more, I cannot really say something profound here, unfortunately, but): At least according to a quick search on NVRTC 12.3 documentation , it seems that structs at least principally should work…

We normally use x3d for several applications:

  1. Presenting CAD models of our cameras and components:
    https://blog.elphel.com/2015/12/x3d-assemblies-from-any-cad/
    Elphel camera assemblies - ElphelWiki
  2. Output from the software where JCuda is used
    x3d models index (interesting scenes, but older camera 3d-printed prototype)
    x3d models index

Index of /files/gltf_test (broken) was an attempt to convert x3d to gltf to view with stereo adapter to a smartphone

Yes, I understand that, and this is why I asked if you tried that at all, even „unreliably“. Getting some insight on what linker „did not like“ would be useful even if I had to reboot the computer after that test.

I already got significant improvement with GPU and the achieved performance is sufficient to implement real-time system. I’m sure it can be optimized as I have very limited experience with CUDA and it is not likely I’ll be able to really master it - there are so many other parts of the project that need to be implemented. So now I plan to wrap-up the GPU code development (add monochrome mode and clean up), then forget about it for some time (likely will have to get into it again if some bugs will come out) and move on to the use of the kernels as black boxes.

Thanks for the pointers to the usage of X3D. I think that I now have a much clearer idea what all this is about, even though this presentation (e.g. the map) is on a completely different technical level than the CUDA topic.

The main reason why I asked is because I’ve also been doing some things with glTF, and some of my former colleagues did even more with X3D and STEP. One of the files of the viewer that is triggered from the map even contained a copyright header pointing to my former workplace.

The gltf_test seems to have been a very early attempt: It uses glTF 1.0, which (paved some roads, but) did not get such a broad support as glTF 2.0. I think the tool support is now much better than in the 1.0 times, and the project explorer at http://github.khronos.org/glTF-Project-Explorer/ contains some converters for X3D and STEP.


Back to CUDA: I did try obtaining the log output, and did several tests with that. The following is an example that is boiled down to the point that shows that just shows how even declaring the log buffers in the JITOptions causes an error:

package jcuda.jnvrtc.test;

import static jcuda.driver.CUjitInputType.CU_JIT_INPUT_PTX;
import static jcuda.driver.CUjit_option.CU_JIT_ERROR_LOG_BUFFER;
import static jcuda.driver.CUjit_option.CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
import static jcuda.driver.CUjit_option.CU_JIT_INFO_LOG_BUFFER;
import static jcuda.driver.CUjit_option.CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLinkAddData;
import static jcuda.driver.JCudaDriver.cuLinkComplete;
import static jcuda.driver.JCudaDriver.cuLinkCreate;
import static jcuda.driver.JCudaDriver.cuLinkDestroy;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadDataEx;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.io.IOException;
import java.util.Arrays;

import jcuda.Pointer;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUfunction;
import jcuda.driver.CUlinkState;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.driver.JITOptions;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

public class JCudaJitLogTest
{
    /**
     * The source code of the program that will be compiled at runtime:
     * A simple vector addition kernel. 
     * 
     * Note: The function should be declared as  
     * extern "C"
     * to make sure that it can be found under the given name.
     */
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void add(int n, float *a, float *b, float *sum)" + "\n" +
        "{" + "\n" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "\n" +
        "    if (i<n)" + "\n" +
        "    {" + "\n" +
        "        sum[i] = a[i] + b[i];" + "\n" +
        "    }" + "\n" +
        "}" + "\n";
    
    public static void main(String args[]) throws IOException
    {
        JNvrtc.setExceptionsEnabled(true);
        JCudaDriver.setExceptionsEnabled(true);
        
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        cuDeviceGet(dev, 0);
        cuCtxCreate(pctx, 0, dev);

        // Use the NVRTC to create a program by compiling the source code
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        
        // Print the compilation log (for the case there are any warnings)
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        
        // Obtain the PTX ("CUDA Assembler") code of the compiled program
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);
        byte bytes[] = ptx[0].getBytes();
        // make bytes zero-terminated
        byte ptxData[] = Arrays.copyOf(bytes, bytes.length + 1); 


        // Create the JITOptions
        JITOptions jitOptions = new JITOptions();
        int logSize = 5000;
        byte[] errorLog = new byte[logSize];
        byte[] infoLog = new byte[logSize];
        
        //*/ 
        // XXX This breaks it:
        jitOptions.putInt(CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, logSize);
        jitOptions.putBytes(CU_JIT_ERROR_LOG_BUFFER, errorLog);
        jitOptions.putInt(CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES, logSize);
        jitOptions.putBytes(CU_JIT_INFO_LOG_BUFFER, infoLog);
        //*/
        
        System.out.println("Options before cuLinkCreate: " 
            + jitOptions.toFormattedString());
        
        CUlinkState state = new CUlinkState();
        cuLinkCreate(jitOptions, state);

        System.out.println("Options after cuLinkCreate: " 
            + jitOptions.toFormattedString());

        cuLinkAddData(state, CU_JIT_INPUT_PTX, 
            Pointer.to(ptxData), ptxData.length, "input.ptx", jitOptions);
        
        System.out.println("errorLog "+new String(errorLog));
        System.out.println("infoLog "+new String(infoLog));
        
        long size[] = { 0 };
        Pointer image = new Pointer();
        cuLinkComplete(state, image, size);
        
        CUmodule module = new CUmodule();
        cuModuleLoadDataEx(module, image,
            0, new int[0], Pointer.to(new int[0]));
        cuLinkDestroy(state);    
        
        // Obtain the function pointer to the "add" function from the module
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "add");
        
        System.out.println("Function " + function);
    }
    
}

(Just change the first //*/ to /*/ to comment out that block).

There are few examples in the CUDA samples that use the JIT and cuModuleLoadDataEx, and the output of the program above suggests that I just messed up the JITOptions implementation: It outputs

Options before cuLinkCreate: JITOptions:
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES=5000
...
Options after cuLinkCreate: JITOptions:
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES=0

indicating that ~„something is overwritten there“. But I’ll have to investigate that further.

Found another problem with nvrtc (same worked fine with nvcc):

ptxDataUnits[0].length=431152
Failed to initialize GPU class
jcuda.CudaException: CUDA_ERROR_INVALID_PTX
	at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
	at jcuda.driver.JCudaDriver.cuLinkAddData(JCudaDriver.java:2755)

So this time size was smaller than I had it working before.
That problem appeared after I added argument to the kernel function, I fixed it by replacing 3 unrelated floats with a float array pointer to „make room“.

extern "C" __global__ void textures_accumulate(
		int             * woi,                // x, y, width,height
		float          ** gpu_clt,            // [NUM_CAMS] ->[TILESY][TILESX][NUM_COLORS][DTT_SIZE*DTT_SIZE]
		size_t            num_texture_tiles,  // number of texture tiles to process
		int             * gpu_texture_indices,// packed tile + bits (now only (1 << 7)
		struct gc       * gpu_geometry_correction,
		int               colors,             // number of colors (3/1)
		int               is_lwir,            // do not perform shot correction
		float             min_shot,           // 10.0
		float             scale_shot,         // 3.0
		float             diff_sigma,         // pixel value/pixel change
		float             diff_threshold,     // pixel value/pixel change
		float             min_agree,          // minimal number of channels to agree on a point (real number to work with fuzzy averages)
#if 0 
		float             weight0,         // scale for R
		float             weight1,         // scale for B
		float             weight2,         // scale for G
#else
		float             weights[3],         // scale for R,B,G
#endif
		int               dust_remove,        // Do not reduce average weight when only one image differs much from the average
		int               keep_weights,       // return channel weights after A in RGBA (was removed) (should be 0 if gpu_texture_rbg)?
// combining both non-overlap and overlap (each calculated if pointer is not null )
		size_t            texture_rbg_stride, // in floats
		float           * gpu_texture_rbg,    // (number of colors +1 + ?)*16*16 rgba texture tiles
		size_t            texture_stride,     // in floats (now 256*4 = 1024)
		float           * gpu_texture_tiles);  // (number of colors +1 + ?)*16*16 rgba texture tiles

So now it works with nvrtc/JCuda again.

I have spent some time with debugging, surrounding cuLinkAddData, but it seems like the only call in all the samples for CUDA happens in the „Advanced/PTXJIT“ example, at

myErr = cuLinkAddData(*lState, CU_JIT_INPUT_PTX, (void *)ptx_source.c_str(),
                      strlen(ptx_source.c_str()) + 1, 0, 0, 0, 0);

All the option parameters are 0. I tried around with some log outputs and such, but when calling the cuLinkAddData function like they do in the sample, it silently kills the application - I mean, entirely, not even a hs_err-log being written. It’s hard to get a grip on that. Maybe I’ll invest more time here later, but right now, this is really just trial-and-error in parts of CUDA that are (at the very least) poorly documented, so I might as well shrug it off and say: „Well, sometimes things don’t work as they should“…