JCuda driver API vs Cuda driver API

Hello! Why JCuda doesn’t provide the runtime API for Cuda (but only for libraries such as Cublas)? In the example code “JCuda driver API” the cubin file is loaded, whereas in Cuda C Programming Guide 3.2 (Driver API) the ptx file is used. What determine this difference ?


I’m not sure what you mean by a support of the CUDA Runtime API. All the functions from the CUDA Runtime API are offered by JCuda. It can be used to interact with the Runtime Libraries, CUBLAS, CUFFT, CUDPP, CURAND and CUSPARSE.

The main “limitation” of the JCuda version of the Runtime API compared to the “real” CUDA Runtime API in C is that you can not call own kernels using the
syntax in JCuda. That’s basically because CUDA is not a library, but a programming language - it’s 99% equal to C, but it’s still a programming language on its own, with its own compiler. You simply can not use the <<<…>>> syntax in Java…

So if you only want to use the Runtime libraries, you can do the memory management etc. with the JCuda Runtime API, and execute the Runtime library functions, e.g. that of JCublas. If you want to write own kernels, you have to use the Driver API. Fortunately, starting with CUDA 3.0, both APIs are interoperable, meaning that you can use the runtime libraries and apply own kernels to the same data.

BTW: I already mentioned in a previous post that the differentiation between runtime and driver API seems rather artificial for me. I’m not sure why NVIDIA decided to implement it that way. Both APIs are essentially equal, except for some renamings, and the possibility to launch own kernels via the <<<…>>>-calls with the Runtime API, or by loading Modules with the driver API - but I’m sure they had their reasons to do so…

The difference between CUBIN and PTX can (in a simplified form, don’t pin me down to this…) be described roughly like that:
The CUBIN file contains the precompiled, binary code. This format is specific for a certain architecture (meaning that it is specific for one version of GPUs, e.g. for GeForce cards, or for Fermi cards…)
The PTX file is something like “CUDA Assembler”. It is a more general description, can can be translated into a CUBIN file at runtime.

In fact, using a PTX would be more flexible than using a CUBIN, and I intend to update the samples to use PTX files in the future. In this case, the “Just-In-Time-Compilation” functions of the Driver API have to be used. I’ll try to do this step by step, after I have finished the update to CUDA 4.0.


Thank you for the answer. I have a few practical questions so I did a simple application which uses my own kernel (JCuda version of the vector additon from Cuda C Programming Guide 3.2, chapter 3.2.1). Unfortunately
I can’t compile my program (nvcc fatal : Cannot find compiler ‘cl.exe’ in PATH).

I use Windows Xp Sp3 and Netbeans 6.7.1. I have installed: Cuda Toolkit 3.2, the newest nvidia drivers for my graphics card Geforce GTX 560 Ti, Microsoft Visual C++ Express Edition 2008 (configured and working properly with cuda, here is the instruction: http://gamelab.epitech.eu/blogtech/?p=13).

I copied necessary JCuda dll’s to my Netbeans project location and added jcuda-0.3.2.a.jar and jcudaUtils-0.0.2.jar to the project. My project folder also contains VectorAdd.cu file.


extern "C"

__global__ void VecAdd(float* A, float* B, float* C, int N) 

	int i = blockDim.x * blockIdx.x + threadIdx.x; 
	if (i < N) 
		C** = A** + B**; 


My JCuda program:

package jcudatest;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.utils.KernelLauncher;
import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaMemcpyKind.*;

public class Main {

    public static void main(String[] args) {

        int x = 30;
        int size = Sizeof.FLOAT * x;

        int vectorA[] = new int[x];
        int vectorB[] = new int[x];
        int vectorC[] = new int[x];

        for(int i = 0; i < x; i++)
            vectorA** = i;
            vectorB** = i;

        Pointer vA = new Pointer();
        cudaMalloc(vA, size);

        Pointer vB = new Pointer();
        cudaMalloc(vB, size);

        Pointer vC = new Pointer();
        cudaMalloc(vC, size);

        cudaMemcpy(vA, Pointer.to(vectorA), size, cudaMemcpyHostToDevice);
        cudaMemcpy(vB, Pointer.to(vectorB), size, cudaMemcpyHostToDevice);

        KernelLauncher kernelLauncher = KernelLauncher.create("VectorAdd.cu", "VectorAdd", false);

        int threadsPerBlock = 256;
        int blockPerGrid = (x + threadsPerBlock - 1) / threadsPerBlock;

        kernelLauncher.setGridSize(blockPerGrid, 0);
        kernelLauncher.setBlockSize(threadsPerBlock, 0, 0);
        kernelLauncher.call(vA, vB, vC, x);

        cudaMemcpy(Pointer.to(vectorC), vC, size, cudaMemcpyDeviceToHost);

        //print result



This seems to be specifically related to the NVCC, and not directly to the JCuda code. What happens if you type
nvcc -cubin VectorAdd.cu -o VectorAdd.cubin
at the console, in the directory where the “VectorAdd.cu” is located?

If it brings the same error message, you may try adjusting the path. This might in the simplest case be solved as described here, by running the vsvars32.bat. Otherwise, you may try to add the path to the cl.exe manually (chances are high that there may be other environment varaibles missing, but maybe it’s possible to get rid of that as well)


nvcc -cubin VectorAdd.cu -o VectorAdd.cubin brings the same error message. I found a solution. Nvcc couldn’t find the visual studio compiler because of the missing path in the environment variables. I edited “path” and added the missing line: “C:\Program Files\Microsoft Visual Studio 9.0\VC\bin”.

But now, I have an another problem when I try to compile my program:

Exception in thread “main” jcuda.CudaException: CUDA_ERROR_INVALID_SOURCE
at jcuda.utils.KernelLauncher.checkResult(KernelLauncher.java:1054)
at jcuda.utils.KernelLauncher.initModule(KernelLauncher.java:688)
at jcuda.utils.KernelLauncher.create(KernelLauncher.java:395)
at jcudatest.Main.main(Main.java:48)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)

Any ideas ?


Yes, sorry, you also have to specify the “Compute Capability” of your Device during the compilation. For example, if your Device has Compute Capability 2.1, you have to write
nvcc -cubin -arch sm_21 VectorAdd.cu -o VectorAdd.cubin

Hope that helps

OK, but I won’t use the console command to compile my cu files. Everything should be done through my program, so what can I change in the code? I assume that the KernelLauncher is responsible for the compilation of the “cu” files to the “cubin” files. I indicate the cu file and the compilation (int this case " nvcc -cubin -arch sm_21 VectorAdd.cu -o VectorAdd.cubin") is performs implicitly.

Of course, this command line usage was for now primarily intended as a test to see whether it’s working in general.

This problem has shown up several times recently, e.g. in this thread and this thread. In the latter, I already said that I’ll extend the KernelLauncher to automatically add the “-sm_XX” parameter to compile for the right platform, and to add support for PTX/Just-In-Time compilation. But at the moment, I’m rather busy, and not sure when I will find the time for this.

In any case, the KernelLauncher has the option to pass additional parameters to the NVCC. So you may call something like
KernelLauncher kernelLauncher = KernelLauncher.create(“VectorAdd.cu”, “VectorAdd”, false**, “-sm_21”**);

Of course, this is not desirable in the sense that it is not generically applicable for all Compute Capabilites. It would be possible to query the Compute Capability, and add the appropriate argument to the KernelLauncher, but as I said, the KernelLauncher should do this automatically in future versions.


First of all, I found some silly mistakes in my program:

  • Sizeof.FLOAT
  • float* in the kernel
  • VectorAdd int the KernelLauncher (should be VecAdd)

I corrected this and set once again the KernelLauncher:

 KernelLauncher kernelLauncher = KernelLauncher.create("VectorAdd.cu", "VecAdd", false, "-arch sm_21");

Then I got an another error:

Exception in thread “main” jcuda.CudaException: CUDA_ERROR_INVALID_VALUE
at jcuda.utils.KernelLauncher.checkResult(KernelLauncher.java:1054)
at jcuda.utils.KernelLauncher.call(KernelLauncher.java:1027)
at jcudatest.Main.main(Main.java:55)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)

I assumed that something wrong was with:

kernelLauncher.setGridSize(blockPerGrid, 0);
kernelLauncher.setBlockSize(threadsPerBlock, 0, 0);

My purpose was set this similarly to the code from Cuda C Programming Guide:

int threadsPerBlock = 256; 
int blocksPerGrid = (N + threadsPerBlock – 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

When I comment out these two lines of the code the program works but returns unexpected result.

VectorA{0, … ,29}
VectorB{0, … ,29}

VectorC{0, … ,58}

Unfortunately program returns:

VectorC{0, -35783105, -597688325, 2147483135, -67387670, -33558577, 2113910719, -545259521
-536870913, 2147331807, -2113569, -8913429, 1610595321, -152581, -538985473, -2146693
-403179587, -1289, -536888577, -142606422, -682133636, -49, -721420297, -67649553
2146942911, -134217925, -131090, 2013133818, -69, -168433409}


OK, so the compilation seems to work now - then to the actual program (I assume that the kernel should accept int* values). Indeed, the setup of the Grid/Block sizes was slightly wrong: A Block of size 256*0 contains 0 threads :wink: Instead of specifying the sizes as 0, they should be 1:
kernelLauncher.setGridSize(blockPerGrid**, 1**);
kernelLauncher.setBlockSize(threadsPerBlock**, 1, 1**);


The nvcc flag:

"-arch sm_21"

is not necessary, since your kernel does not uses any cuda 2.1 feature, not even 1.1 features, so leave it out. Always compile with lowest possible architecture.


Mostly indicates indexing out of bounds from my own experience, or wrong input parameters resulting in
kernel errors or wrong Size settings of Pointers

VectorC{0, -35783105, -597688325, 2147483135, -67387670, -33558577, 2113910719, -545259521

Same as above!, extreme unsuspected values are almost guaranteed for the above problems.