Dynamic parallelism

Was there any resolution to this?

cuda - Using kernels with dynamic parallelism in jcuda - Stack Overflow

My setup for compiling this was:

nvcc -m64 -cubin -lcublas_device -lcudadevrt --device-c -I=$(CURR_DIR_INCLUDE) -I=$(COMMON_INCLUDE) -gencode arch=compute_52,code=sm_52 -rdc=true

The variables are just from make for imports, mainly trying to double check the flags.

Relevant compilation snippet:
JITOptions jitOptions = new JITOptions();
CUlinkState state = new CUlinkState();
cuLinkCreate(jitOptions, state);
boolean worked = false;
JCudaDriver.cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_LIBRARY, “/usr/local/cuda-7.5/lib64/libcudadevrt.a”, jitOptions);
cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_CUBIN, moduleFileName, jitOptions);

    long sz[] = new long[60000];
    Pointer image = new Pointer();
    cuLinkComplete(state, image, sz);
    cuLinkDestroy(state);
  I'm getting:

jcuda.CudaException: CUDA_ERROR_NO_BINARY_FOR_GPU

Hello

One comment to this question said: “Just got it to work! nvcc was naming the file something different than what I thought …” - so I considered this as resolved, and did not pursue it further.

In general, as one might guess from the JITOptions class, this part of the API is at the border of what may reasonably be represented in Java with the (simple) Pointer class (I did not have all details of JIT compilation on the radar back when I started JCuda…). Of course, I did some basic tests - that’s why I introduced the JITOptions after all - but will have to do more tests with the JIT, considering the possible linker options, runtime CUBIN and PTX management, and especially the linking against libraries like libcudadevrt (this, admittedly, was not tested at all until now - I’ll put the task to create a test/sample for this on my “todo” list…)

But regarding your actual question: The size parameter in cuLinkComplete does not receive CUBIN data or so. It receives the size of the generated CUBIN data. Also see the cuLinkComplete docs. So it should usually be a 1-element array.

The actual CUBIN data is contained in the image pointer. And in your example, you don’t seem to do anything with this pointer, but immediately destroy the state, which also deallocates the image

So the relevant parts for the JIT compilation should roughly look like this:

...

// Create a ("null") Pointer that will point to the CUBIN data after this call
// and an array to receive the CUBIN size (this may be null)
Pointer cubinOut = new Pointer();
long size[] = { -1 };
cuLinkComplete(linkState, cubinOut, size);
System.out.println("Size of CUBIN data: "+size[0]);

// Load the module from the CUBIN data
CUmodule module = new CUmodule();
cuModuleLoadDataEx(module, cubinOut, 0, new int[0], Pointer.to(new int[0]));   

// Afterwards, destroy the state (and with that, the CUBIN data that
// is now no longer needed, as it was loaded as a module)
cuLinkDestroy(linkState);

(Also see this answer on stackoverflow for some general hints about loading JIT-compiled modules)

If it does not work, don’t waste too much time fiddling around with it (just drop me a note here). I’ll try do create a test/sample in the next few days.

bye
Marco

Relevant part:
KernelLauncher kernelLauncher = new KernelLauncher();
JITOptions jitOptions = new JITOptions();
CUlinkState state = new CUlinkState();
cuLinkCreate(jitOptions, state);
//tried with AND without this
JCudaDriver.cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_LIBRARY, “/usr/local/cuda/lib64/libcudadevrt.a”, jitOptions);
cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_CUBIN, moduleFileName, jitOptions);

    long sz[] = new long[1];
    Pointer image = new Pointer();
    cuLinkComplete(state, image, sz);
    kernelLauncher.initModule(image);
    kernelLauncher.initFunction(functionName);
    cuLinkDestroy(state);

Compilation line:
nvcc -cubin -rdc=true -I=/home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/transforms/include -I=/home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/transforms/…/common -arch=sm_52 /home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/transforms/abs/abs_strided.cu -lcudadevrt -lcudart -o /home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/transforms/…/…/output/abs_strided.cubin

Getting:
CUDA_ERROR_NO_BINARY_FOR_GPU on attempting to add the file

Edit:
With compilation line (adding --device-link) I noticed he had that in his comment and figured I’d try that as well:
nvcc -cubin -rdc=true -I=/home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/pairwise_transforms/include -I=/home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/pairwise_transforms/…/common -arch=sm_52 /home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/pairwise_transforms/add/add_strided.cu --device-link -lcudadevrt -lcudart -o /home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/pairwise_transforms/…/…/output/add_strided.cubin
nvlink error : Undefined reference to ‘Z2opIdET_S0_PS0’ in ‘/tmp/tmpxft_0000602c_00000000-16_add_strided.o’
nvlink error : Undefined reference to ‘Z2opIdET_S0_S0_PS0’ in ‘/tmp/tmpxft_0000602c_00000000-16_add_strided.o’
nvlink error : Undefined reference to ‘Z2opIfET_S0_PS0’ in ‘/tmp/tmpxft_0000602c_00000000-16_add_strided.o’
nvlink error : Undefined reference to ‘Z2opIfET_S0_S0_PS0’ in ‘/tmp/tmpxft_0000602c_00000000-16_add_strided.o’
Makefile:18: recipe for target ‘add’ failed
make[1]: *** [add] Error 255
make[1]: Leaving directory ‘/home/agibsonccc/code/nd4j/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/resources/org/nd4j/nd4j-kernels/src/pairwise_transforms’
Makefile:12: recipe for target ‘pairwise_transforms’ failed
make: *** [pairwise_transforms] Error 2

OK, I did not really understand the last post (you created an extended version of the KernelLauncher?), and to which point of each process the different parts of information refer, but I’ll give it a try ASAP, and see whether I can JIT-link an example without and with an external library.

Let’s take our focus off of the external stuff :). That’s distracting from the actual question. Here’s the code without the kernel launcher.

I think my main confusing maybe coming from where you use the JIT compiler options and what you compile on the command line.

JITOptions jitOptions = new JITOptions();
CUlinkState state = new CUlinkState();
cuLinkCreate(jitOptions, state);
//tried with AND without this
JCudaDriver.cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_LIBRARY, "/usr/local/cuda/lib64/libcudadevrt.a", jitOptions);
cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_CUBIN, moduleFileName, jitOptions);

long sz[] = new long[1];
Pointer image = new Pointer();
cuLinkComplete(state, image, sz);
CUModule module = new CUModule();
 cuModuleLoadDataEx(module, image,
                0, new int[0], Pointer.to(new int[0]));
cuLinkDestroy(state);

Here are the cuda kernels:

pairwise_transform.h

#include <math.h>

//x** and y**
template <typename T>
__device__ T op(T d1,T d2,T *params);
template <typename T>
__device__ T op(T d1,T *params);

template <typename T>
__device__ void transform(int n,int xOffset,int yOffset, T *dx, T *dy,int incx,int incy,T *params,T *result,int incz,int blockSize) {

	int totalThreads = gridDim.x * blockDim.x;
	int tid = threadIdx.x;
	int i = blockIdx.x * blockDim.x + tid;

	if (incy == 0) {
		if ((blockIdx.x == 0) && (tid == 0)) {
			for (; i < n; i++) {
				result[i * incz] = op(dx[i * incx],params);
			}

		}
	} else if ((incx == incy) && (incx > 0)) {
		/* equal, positive, increments */
		if (incx == 1) {
			/* both increments equal to 1 */
			for (; i < n; i += totalThreads) {
				result[i * incz] = op(dx**,dy**,params);
			}
		} else {
			/* equal, positive, non-unit increments. */
			for (; i < n; i += totalThreads) {
				result[i * incz] = op(dx[i * incx],dy[i * incy],params);
			}
		}
	} else {
		/* unequal or nonpositive increments */
		for (; i < n; i += totalThreads) {
			result[i * incz] = op(dx[i * incx],dy[i * incy],params);
		}
	}
}

add_strided.cu

#include <pairwise_transform.h>

__device__ double op(double d1,double d2,double *params) {
   return d1 + d2;
}
__device__ double op(double d1,double *params) {
   return d1;
}


__device__ float op(float d1,float d2,float *params) {
   return d1 + d2;
}
__device__ float op(float d1,float *params) {
   return d1;
}


__global__ void add_strided_double(int n,int xOffset,int yOffset, double *dx, double *dy,int incx,int incy,double *params,double *result,int incz,int blockSize) {
    transform<double>(n,xOffset,yOffset,dx,dy,incx,incy,params,result,incz,blockSize);
}


__global__ void add_strided_float(int n,int xOffset,int yOffset, float *dx, float *dy,int incx,int incy,float *params,float *result,int incz,int blockSize) {
    transform<float>(n,xOffset,yOffset,dx,dy,incx,incy,params,result,incz,blockSize);
}

I’m compiling it with:

nvcc -O3 -cubin -rdc=true -gencode arch=compute_52,code=sm_52 -dc add_strided.cu -lcudadevrt -lcudart -o add_strided.cubin

An update on this:

The architecture was causing the addLink to fail. Now I’m getting the cuda error unknown very similar to the other thread.

Sample line here:
cd tanh && nvcc -O4 -ptx -rdc=true -I=/tmp/nd4j-kernels/src/transforms/include -I=/tmp/nd4j-kernels/src/transforms/…/common -arch=sm_50 /tmp/nd4j-kernels/src/transforms/tanh/tanh_strided.cu -lcudadevrt -lcudart -o /tmp/nd4j-kernels/src/transforms/…/…/output/tanh_strided.ptx

I’m following the steps described above as:

JITOptions jitOptions = new JITOptions();
CUlinkState state = new CUlinkState();
cuLinkCreate(jitOptions, state);
cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_PTX, moduleFileName, jitOptions);
// byte[] content = ByteStreams.toByteArray(new FileInputStream(moduleFileName));
long sz[] = new long[1];
Pointer image = new Pointer();
cuLinkComplete(state, image, sz);

//load module…

    cuLinkDestroy(state);

A short “progress report”: I tried the JIT linking against a CUBIN, and am also receiving a CUDA_ERROR_UNKNOWN during cuLinkComplete - but only when I try to call CUBLAS functions in the kernel (I did this for a basic test of the kernel API and library linking). But with an empty dummy kernel, it works properly - so I guess there is something wrong with some of the linker settings or the general setup.

BTW: Is the main focus of all this on dynamic parallelism, or on JIT-linking against external libraries? (When trying to create samples/tests, I’d like to keep them short and focussed, if possible…)

Dynamic parallelism.

I’m able to reproduce the problem in cuda c here:

Maybe that helps?

*** Edit ***

So a few more notes. I’ve contacted the author of the stack overflow post. The solution ended up NOT using the device level linker but the second solution.

I’ve also learned the problem is on my side with templates. I’ll update one last time when I get it working. Thanks again for the patience!

Thanks for this hint - CUDA C is usually the “ground truth”, and when there is a problem, I usually compare the behavior of JCuda to CUDA C to rule out (or confirm) errors in JCuda. But the JIT part is a bit delicate, and I assume that there still may be bugs…

But… which “second solution” do you refer to?

In any case, I’ll try to assemble some examples, at least one for JIT+PTX and one for linking against a library (like for the CUBLAS Kernel API).

If you have any further hints or insights (or find out that (or why) certain things plainly do not work, even in CUDA C), I’d be happy to hear about that.

Sorry. I’ll clear this up:
I mean this for the second solution. Basically, he had tried 2 things.
JCudaDriver.cuModuleLoadData(module, ByteStreams.toByteArray(KernelTreeExecutor.class.getResource("/rnn.cubin").openStream()));

and the other one was the normal cuLinkAddFile/finalize approach.

Next, I have it working the same way. It wasn’t via the cuda dynamic linker though. I ended up using the same technique.

The cuda kernels can be found here:
https://github.com/deeplearning4j/nd4j-kernels

I’m using this snippet for initiModule
/**
* Initialize the module for this KernelLauncher by loading
* the PTX- or CUBIN file with the given name.
*
* @param moduleData The data from the PTX- or CUBIN file
*/
private void initModule(byte moduleData[])
{
module = new CUmodule();
cuModuleLoadData(module,moduleData);
}

The module itself is just the loaded bytearray from the file being compiled in the given project.

Happy to answer other questions when you’re building your sample. I’m sure other users will want to build dynamic parallelism. Due to the way jcuda works, I’m sure you will end up with a disproportionate number of questions about this.

@agibsonccc I started with creating a basic sample that uses dynamic parallelism. It uses a CUBIN file and does not involve any runtime linking, but maybe it’s a start.

JCudaDynamicParallelismKernel.cu (with some compilation instructions … these are subtle, indeed - I also fiddled around with some CUDA_ERROR_NOT_FOUND’s etc…)

#include <stdio.h>

// A simple example of using dynamic parallelism. This kernel can
// be compiled into an object file by calling
//
//     nvcc -dc -arch=sm_52 JCudaDynamicParallelismKernel.cu -o JCudaDynamicParallelismKernel.o
// 
// The resulting object file can be linked into a CUBIN file with
// 
//     nvcc -dlink -arch=sm_52 -cubin JCudaDynamicParallelismKernel.o -o JCudaDynamicParallelismKernel.cubin
// 
// Alternatively, both steps can be taken at once, by calling
// 
//     nvcc -dlink -arch=sm_52 -cubin -c JCudaDynamicParallelismKernel.cu -o JCudaDynamicParallelismKernel.cubin
// 
// The architecture (here, sm_52) must match the architecture of
// the target device. 

extern "C"
__global__ void childKernel(unsigned int parentThreadIndex, float* data)
{
    printf("Parent thread index: %d, child thread index: %d
", 
        parentThreadIndex, threadIdx.x);    
    data[threadIdx.x] = parentThreadIndex + 0.1f * threadIdx.x;
}

extern "C"
__global__ void parentKernel(unsigned int size, float *data)
{
    childKernel<<<1, 8>>>(threadIdx.x, data + threadIdx.x * 8);
    cudaDeviceSynchronize();
    __syncthreads();
}

Main/Test class:

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;

import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;

/**
 * A simple example showing how a kernel with dynamic parallelism 
 * can be loaded from a CUBIN file and launched.
 */
public class JCudaDynamicParallelism
{
    public static void main(String[] args)
    {
        JCudaDriver.setExceptionsEnabled(true);
        
        // Initialize a context for the first device
        cuInit(0);
        CUcontext context = new CUcontext();
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        cuCtxCreate(context, 0, device);
        
        // Load the precompiled CUBIN file 
        // (See JCudaDynamicParallelismKernel.cu for details)
        CUmodule module = new CUmodule();
        cuModuleLoad(module, "JCudaDynamicParallelismKernel.cubin");
        
        // Obtain a function pointer to the "parentKernel" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "parentKernel");

        // Define the nesting structure (note that the number
        // of child threads must match the value that is used
        // in the kernel)
        int numParentThreads = 8;
        int numChildThreads = 8;

        // Allocate the device data that will be filled by the kernel
        int numElements = numParentThreads * numChildThreads;
        CUdeviceptr deviceData = new CUdeviceptr();
        cuMemAlloc(deviceData, numElements * Sizeof.FLOAT);

        // Set up the kernel parameters: A pointer to an array
        // of pointers which point to the actual values.
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[] { numElements }),
            Pointer.to(deviceData)
        );

        // Call the kernel function.
        int blockSizeX = numParentThreads;
        int gridSizeX = (numElements + numElements - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,      // Grid dimension
            blockSizeX, 1, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
        cuCtxSynchronize();
        
        // Copy the device data to the host
        float hostData[] = new float[numElements];
        for(int i = 0; i < numElements; i++)
        {
            hostData** = i;
        }
        cuMemcpyDtoH(Pointer.to(hostData), 
            deviceData, numElements * Sizeof.FLOAT);
        
        // Compare the host data with the expected values
        float hostDataRef[] = new float[numElements];
        for(int i = 0; i < numParentThreads; i++)
        {
            for (int j=0; j < numChildThreads; j++)
            {
                hostDataRef[i * numChildThreads + j] = i + 0.1f * j;
            }
        }
        System.out.println("Result: "+Arrays.toString(hostData));
        boolean passed = Arrays.equals(hostData, hostDataRef);
        System.out.println(passed ? "PASSED" : "FAILED");

        // Clean up.
        cuMemFree(deviceData);
    }
}

If you think that this makes sense, I’d put it on the Samples page of the website.

There are several possible next steps:

  • Trying runtime linking
  • Trying to use PTX
  • Using the kernel API of a library like CUBLAS

However, I’m not sure when I will find the time to tackle each of them. I’d also like to continue with the attempts of mavenizing JCuda, put the samples and JCuda utilities on GitHub as well, and (off-topic?) create JOCL bindings for clBLAS, and … continue working on a few dozen other (smaller, not-directly-GPU-related) private projects.
BTW: The code snippet that you posted looked like a “generalization” of jcuda-vec. I like the idea of having one general kernel and “inlining” arbitrary operations, like a “parallel forall” (somehow like in thrust), but I guess that’s where the dynamic (runtime) linking is indeed required.

@Marco13

This looks good for the basic samples on the site. I’m not sure ptx is needed. I would just try to steer people in a certain direction and if ptx is needed for some reason get the “why”. You can’t satisfy everyone ;).

The cublas one might be interesting but we can already use it directly. If they’re doing stuff that fancy they’re likely going to b doing it in c I imagine. The main use case I had was launching several kernels for varying strided operations (eg; row wise sums/oblong tensor dimension wise operations).

I was originally computing the strides on the java side and then sending information to the card, but that got expensive. I have since moved over the striding operations to the cuda side and run it in each child kernel now.

Getting back to my point about cublas, if I were to be running cublas operations that needed to be on the kernel I would write it in c.

That might be just me here though.

Re: Maven: I’ve been deploying jcuda to maven central for nd4j for a while now. What I’ve found works in practice here when deploying native deps is usually pre compiled has been good enough for most people. We sometimes see issues with the right linking of glibc though.

Re: jcudavec. I have to have this op architecture. https://github.com/deeplearning4j/nd4j/blob/f39f670a4e369d9e8e32b38105921e0d5de8ec84/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/java/org/nd4j/linalg/jcublas/kernel/KernelFunctions.java#L110-110 This is just apart of the abstraction etc. It would also allow people to write their own cuda kernels/operations.Happy to open up a separate thread on this as well.

@agibsonccc

The first samples usually used CUBINs (and in fact, some of them still do - maybe I can clean them up during the transition of the samples to GitHub). Later, I primarily used PTX, because it is far more flexible in terms of the target architecture. And in the spirit of „Write once, run everywhere“, I’d like to add examples that are as versatile as possible in this regards (but this means that I have to dig deeper into the JIT and linking topic…)

The cublas one might be interesting but we can already use it directly. If they’re doing stuff that fancy they’re likely going to b doing it in c I imagine. The main use case I had was launching several kernels for varying strided operations (eg; row wise sums/oblong tensor dimension wise operations).

This should mainly serve as an example, showing how to use any kernel-API library. Of course, most uses of the CUBLAS Kernel API can be „emulated“ with multiple kernel launches, but … I guess there is a reason why NIVIDIA introduced this recently: If one (of many) operations in a kernel is a dot product or matrix multiplication, it’s nice to have the option to simply call it, and not having to write it on your own, or do some ping-pong between host and device code, with the potential overhead of multiple kernel launches. The option to write everything in C is not applicable, and … somehow defeats the purpose of JCuda itself :wink:

Re: Maven: I’ve been deploying jcuda to maven central for nd4j for a while now. What I’ve found works in practice here when deploying native deps is usually pre compiled has been good enough for most people. We sometimes see issues with the right linking of glibc though.

Thanks for this hint. There is also GitHub - MysterionRise/mavenized-jcuda: Mavenized JCuda, please use version available in Maven Central , so there are some examples. However, deploying natives to Maven always seems to be a bit of a hassle. I’m considering several options now, but did not proceed with the „mavenization“ in the last few days…

Re: jcudavec. I have to have this op architecture. https://github.com/deeplearning4j/nd4j/blob/f39f670a4e369d9e8e32b38105921e0d5de8ec84/nd4j-jcublas-parent/nd4j-jcublas-7.5/src/main/java/org/nd4j/linalg/jcublas/kernel/KernelFunctions.java#L110-110 This is just apart of the abstraction etc. It would also allow people to write their own cuda kernels/operations.Happy to open up a separate thread on this as well.

I’m occasionally thinking about possible generalizations. In the end, one could even detect kernel signatures and automatically generate Dynamic Proxies for calling them. I haven’t studied in detail how far you went with the Kernel calling infrastructure in nd4j, but from what I have seen so far (through a few glances at the code), it looks far more sopisiticated than everything that I could do im my spare time…