Trying to use dynamic parallelism

hi, thanks for jcuda5.5, my program works good again!

Now I make an attempt to use dynamic parallelism and it isn’t work in JCUDA on sm_35 GPU.
I successfully compiled .ptx file from .cu with sm_35 option, but cuModuleLoad(module, ptxFileName); fails (jcuda.CudaException: CUDA_ERROR_NO_BINARY_FOR_GPU).

Standart code without dynamic parallelism similar to:

	return data+1;
extern "C" __global__ void parent_func(int *data)
	int thread=threadIdx.x;

Dynamic parallelism requires code in kernel (.cu) like:
```extern "C" __global__ void child_func(int *data)
extern "C" __global__ void parent_func(int *data)

And next code ```child_func<<<1,1>>>(data);``` fails in .cu (compiled successfully but isn't work). Maybe <<<>>> instruction can't be used in jcuda .cu?


The kernel<<<...>>> call syntax can not be used in Java, but the .CU file is compiled with the NVCC to generate a PTX file, and from that point, CUDA should not even „know“ that it received this data from Java eventually: It is just passed to the CUDA library as-it-is.

Unfortunately, I don’t have a 3.5 GPU (not even at my workplace, where we have at least some newer ones), so I can not run any tests for this at the moment.

Googling the error message did not bring any really helpful insights. So some guessing:

  • You are compiling the CU file into a PTX file with sm_35 and this works. Can you post the resulting PTX file (for the small example that you posted)?
  • Are you sure that the context which you created is for the right device (namely the one that has CC 3.5)? If there are multiple devices available, you might have picked the wrong one (at least, according to my interpretation of the error message).

I tried to find a plain CUDA C example where dynamic parallelism is used with the driver API, but there seems to be none. (One important step (for me) is always to check whether the same error happens with plain CUDA C, to see whether the problem is related to JCuda or whether it’s a general CUDA problem :wink: )

Maybe some more information can be obtained from the info- or error log buffer that can be queried when loading the PTX file with cuModuleLoadDataJIT. (This is one of the few places where I felt that I had to use a workaround for the weird pointer handling, and I’m not sure whether the current solution is appropriate, but the functionality of the underlying method, cuModuleLoadDataEx, could be helpful here). If necessary, I can create a simple example on monday, showing how the JIT functionality may be used).

This .ptx file compiled for

extern "C" __global__ void parent_func(int *data)

by command **nvcc -m32 -ptx -arch sm_35 -dc -o filename.ptx**

So **cuModuleLoad(module, ptxFileName);** fails. Thats why I suggest that problem in jcuda.

Thank you!

OK, I can possibly upload a sample that loads a PTX file with cuModuleLoadDataJIT (today or tomorrow), maybe the logs will contain more information. Otherwise, I’ll try to create a similar test that tries to load the same PTX file using plain CUDA C, to see whether the same error appears there.


I just tried to load the PTX file with this program

package tests;


import jcuda.Pointer;
import jcuda.driver.*;

public class JITModuleLoadTest
    public static void main(String args[]) throws IOException
        String ptxFileName = "JCudaDynamicParallelismKernel.ptx";
        // Initialize the driver and create a context for the first device.
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        JCudaDriver.cuDeviceGet(dev, 0);
        JCudaDriver.cuCtxCreate(pctx, 0, dev);

        // Load the ptx file.
        CUmodule module = new CUmodule();
        byte ptxData[] = toByteArray(new FileInputStream(ptxFileName));
        JITOptions jitOptions = new JITOptions();
        jitOptions.putInt(CUjit_option.CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, 5000);
        jitOptions.putBytes(CUjit_option.CU_JIT_ERROR_LOG_BUFFER, new byte[5000]);
        jitOptions.putInt(CUjit_option.CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES, 5000);
        jitOptions.putBytes(CUjit_option.CU_JIT_INFO_LOG_BUFFER, new byte[5000]);
        jitOptions.putInt(CUjit_option.CU_JIT_TARGET, CUjit_target.CU_TARGET_COMPUTE_35);
        JCudaDriver.cuModuleLoadDataJIT(module,, jitOptions);

     * Fully reads the given InputStream and returns it as a byte array.
     * @param inputStream The input stream to read
     * @return The byte array containing the data from the input stream
     * @throws IOException If an I/O error occurs
    private static byte[] toByteArray(InputStream inputStream) throws IOException
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte buffer[] = new byte[8192];
        while (true)
            int read =;
            if (read == -1)
            baos.write(buffer, 0, read);
        return baos.toByteArray();

but it reports an error

    CU_JIT_ERROR_LOG_BUFFER=ptxas : fatal error : Unresolved extern function 'cudaGetParameterBuffer'

So I’m not sure how to circumvent this. One guess is that it MIGHT work when you compile it into a CUBIN file, because this should be the whole binary. It should be worth a try, at least…


hi, I compiled .cu into .cubin, but I have the same error in kernelLauncher=KernelLauncher.create(folder+"","func",""); (jcuda.CudaException: CUDA_ERROR_NO_BINARY_FOR_GPU) when I use <<<>>>. Maybe there’s some characteristics where I mistake.
So it’s just were interesting to use dynamic parallelism and I will wait for working examples from you in future.

Hm… This is not really a satisfactory result. I’ll try to build an example with CUDA-C, but I have not yet seen an example that uses dynamic parallelism with the driver API…

hi, can you build an example with dynamic parallelism (DP) from cuda v5.5 samples?
For example, cdpSimpleQuicksort (…\NVIDIA Corporation\CUDA Samples\v5.5\0_Simple\cdpSimpleQuicksort)
List of DP samples in cuda toolkit 5.5 samples: cdpSimplePrint,cdpSimpleQuickSort,cdpAdvancedQuickSort,cdpBezierTesselation,cdpLUDecomposition,cdpQuadTree.
Thank you!

Sorry, this thread seems to have got lost :o I have sent me a reminder, I’ll try to do this on monday, but still: I don’t have a device with CC 3.5, so will hardly be able to test this at all…

OK, I tried to create a basic test for dynamic parallelism with the driver API. This may be a bit of a hassle, for several reasons, but … however:

I created a backup of the ‘cdpSimplePrint’ sample project, and just built a test based on this project. I changed the ‘’ main file to

#include <iostream>
#include <cstring>
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cuda.h>
#include <builtin_types.h>

using namespace std;

CUdevice device;
CUcontext context;
CUmodule module;
CUfunction kernelFunction;

int main(int argc, char **argv)
    int error = 0;
    int max_depth = 2;
    int devID = 0;

    error = cuInit(0);
	if (error != 0) { printf("Error %d
", error); return error; }

    error = cuDeviceGet(&device, devID);
	if (error != 0) { printf("Error %d
", error); return error; }

    error = cuCtxCreate(&context, 0, device);
	if (error != 0) { printf("Error %d
", error); return error; }

    string module_path = "SimplePrintKernel.cubin";
	error = cuModuleLoad(&module, module_path.c_str());
	if (error != 0) { printf("Error %d
", error); return error; }

    error = cuModuleGetFunction(&kernelFunction, module, "cdp_kernel");
	if (error != 0) { printf("Error %d
", error); return error; }

    int threadsPerBlock = 2;
    int blocksPerGrid   = 2;
	int depth = 0;
	int thread = 0;
	int parent_uid = 0;
    void *args[] = { &max_depth, &depth, &thread, &parent_uid };

    // Launch the kernel from the CPU.
    printf("Launching cdp_kernel() with CUDA Dynamic Parallelism:

    error = cuLaunchKernel(kernelFunction,  blocksPerGrid, 1, 1,
                           threadsPerBlock, 1, 1,
                           NULL, args, NULL);
	if (error != 0) { printf("Error %d
", error); return error; }

    error = cuCtxSynchronize();
	if (error != 0) { printf("Error %d
", error); return error; }



(Yes, it is a VERY basic test…)

I tries to load a ‘SimplePrintKernel.cubin’ from the same directory. This CUBIN was created from this source code

#include <stdio.h>

// Variable on the GPU used to generate unique identifiers of blocks.
__device__ int g_uids = 0;

// Print a simple message to signal the block which is currently executing.
__device__ void print_info(int depth, int thread, int uid, int parent_uid)
    if (threadIdx.x == 0)
        if (depth == 0)
            printf("BLOCK %d launched by the host
", uid);
            char buffer[32];

            for (int i = 0 ; i < depth ; ++i)
                buffer[3*i+0] = '|';
                buffer[3*i+1] = ' ';
                buffer[3*i+2] = ' ';

            buffer[3*depth] = '\0';
            printf("%sBLOCK %d launched by thread %d of block %d
", buffer, uid, thread, parent_uid);


// The kernel using CUDA dynamic parallelism.
// It generates a unique identifier for each block. Prints the information
// about that block. Finally, if the 'max_depth' has not been reached, the
// block launches new blocks directly from the GPU.
extern "C"
__global__ void cdp_kernel(int max_depth, int depth, int thread, int parent_uid)
    // We create a unique ID per block. Thread 0 does that and shares the value with the other threads.
    __shared__ int s_uid;

    printf("Thread %d
", threadIdx.x);

    if (threadIdx.x == 0)
        s_uid = atomicAdd(&g_uids, 1);


    // We print the ID of the block and information about its parent.
    print_info(depth, thread, s_uid, parent_uid);

    // We launch new blocks if we haven't reached the max_depth yet.
    if (++depth >= max_depth)

    cdp_kernel<<<gridDim.x, blockDim.x>>>(max_depth, depth, threadIdx.x, s_uid);


The command line for compiling the kernel into a CUBIN should be
nvcc -m64 -cubin -arch sm_35 -lcudadevrt

Of course, this is for a CC 3.5-device. For my test, I had to remove the recursive launch and compile it for CC 2.0. But I think that it COULD basically work to use dynamic parallelism even from the driver API like this.

There are some sections about the compilation process in the CUDA- and NVCC documentation ( , …), but admittedly, I did not (yet) read them in detail (it’s quite a lot, and there’s probably only a few lines really relevant for this issue…).

However, if the above works, I think that with the same CUBIN file, it should work exactly the same way in JCuda.

Any information about whether it works (either in CUDA-C or in JCuda/Java, or both) would be helpful…

unfortunately, I got the same error((

Even with the CUDA-C version? Then I have to ask again: Do you have multiple devices, and may the wrong one be chosen? Can you run the original DP samples as-they-are?

I run original cdpSimplePrint from ‘CUDA samples browser’ and seems it works (at attached image). Yes, I have only 1 GPU.

And the test program prints an error 209 (i.e. CUDA_ERROR_NO_BINARY_FOR_GPU) after
error = cuModuleLoad(&module, module_path.c_str());

after KernelLauncher kernelLauncher = KernelLauncher.create(“t_cuda//p_modelsCuda//”,“cdp_kernel”,"");
I get error in checkResult(cuModuleLoadDataEx(module,,0, new int[0], int[0])));
CUDA_ERROR_NO_BINARY_FOR_GPU at jcuda.utils.KernelLauncher.checkResult(

Another example - the same error in cuModuleLoad(module, “t_cuda//p_modelsCuda//testold.ptx”);
CUDA_ERROR_NO_BINARY_FOR_GPU at jcuda.driver.JCudaDriver.checkResult(
at jcuda.driver.JCudaDriver.cuModuleLoad(

It seems like in both cases you are using PTX files. As described above, it might be the case that this will work only with CUBIN files. Did you have the chance to run a test with the CUBIN file for the “” that I mentioned?

(BTW: A while ago, I considered using CUBIN files in the KernelLauncher instead of PTX. (CUBINs are also necessary for producing debug information). If dynamic parallelism only works with CUBINs, that could be another reason to change the KernelLauncher like this, but that’s not entirely clear at the moment…)

in some cases your example after compiling to .ptx is launched however not in original state and seems it works not correctly. But I can’t understand how it works.
My own examples based on examples from internet lead to errors as presented before.
So in length of time if I better undestand how DP works I will share some examples.
Thank you for help!

So for me, it’s still not clear which version works and which one does not work, or whether anything that works in CUDA does or does not work in JCuda, but … maybe I’ll never find out.