Mix Driver & Runtime APIs

Hello everyone.

I heard that We can use Driver and Runtime APIs together on CUDA 3.0.
I want to use minimum Driver API calls(only for prepare .cubin files and create kernel functions).
So, I want to use: cudaConfigureCall() instead cuParamSetv() and etc.

I’m trying to program the sample. My kernel function “MatrixMul” works with Driver API realization of calling kernel function. But mix realization does not work. Kernel function returns zeroes in result array. cudaConfigureCall() and cudaSetupArgument() return “cudaSuccess”, but cudaLaunch() returns “cudaErrorInvalidDeviceFunction”.
Code:

public static void mulMatrixCUDrvCURun(float[] aCU,float[] bCU,float[] cCU,int n) throws IOException{
        
        int sizeMatrix=aCU.length;

        // Initialize the driver and create a context for the first device.
        JCudaDriver.cuInit(0);
        CUcontext context = new CUcontext();
        CUdevice device = new CUdevice();
        JCudaDriver.cuDeviceGet(device, 0);
        JCudaDriver.cuCtxCreate(context, 0, device);

        String cubinFileName = prepareCubinFile("myMatrMul.cu");

        // Load the CUBIN file.
        CUmodule module = new CUmodule();
        JCudaDriver.cuModuleLoad(module, cubinFileName);
        // Obtain a function pointer to the "MatrixMul" function.
        CUfunction function = new CUfunction();
        JCudaDriver.cuModuleGetFunction(function, module, "MatrixMul");

        // Allocate memory on the device using JCuda
        Pointer deviceA = new Pointer();
        Pointer deviceB = new Pointer();
        Pointer deviceC = new Pointer();
        JCuda.cudaMalloc(deviceA, sizeMatrix * Sizeof.FLOAT);
        JCuda.cudaMalloc(deviceB, sizeMatrix * Sizeof.FLOAT);
        JCuda.cudaMalloc(deviceC, sizeMatrix * Sizeof.FLOAT);

        // Copy memory from host to device using JCuda
        JCuda.cudaMemcpy(deviceA, Pointer.to(aCU), sizeMatrix * Sizeof.FLOAT,cudaMemcpyKind.cudaMemcpyHostToDevice);
        JCuda.cudaMemcpy(deviceB, Pointer.to(bCU), sizeMatrix * Sizeof.FLOAT,cudaMemcpyKind.cudaMemcpyHostToDevice);
        JCuda.cudaMemcpy(deviceC, Pointer.to(cCU), sizeMatrix * Sizeof.FLOAT,cudaMemcpyKind.cudaMemcpyHostToDevice);

        int offset = 0;
        dim3 blocks=new dim3(n/BLOCK_SIZE, n/BLOCK_SIZE, 1);
        dim3 threads=new dim3(BLOCK_SIZE, BLOCK_SIZE, 1);
        int err=JCuda.cudaConfigureCall(blocks,threads,0,null);
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        err=JCuda.cudaSetupArgument(Pointer.to(deviceA), Sizeof.POINTER, offset);
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        offset+=Sizeof.POINTER;
        err=JCuda.cudaSetupArgument(Pointer.to(deviceB), Sizeof.POINTER, offset);
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        offset+=Sizeof.POINTER;
        err=JCuda.cudaSetupArgument(Pointer.to(deviceC), Sizeof.POINTER, offset);
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        offset+=Sizeof.POINTER;
        JCuda.cudaSetupArgument(Pointer.to(new int[]{n}), Sizeof.INT, offset);
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        offset+=Sizeof.INT;
        // Call the function.
        err=JCuda.cudaLaunch("MatrixMul");
        System.out.print(jcuda.runtime.cudaError.stringFor(err)+" ");
        JCuda.cudaThreadSynchronize();

        JCuda.cudaMemcpy(Pointer.to(cCU), deviceC, sizeMatrix * Sizeof.FLOAT, cudaMemcpyKind.cudaMemcpyDeviceToHost);

        // Clean up
        JCuda.cudaFree(deviceA);
        JCuda.cudaFree(deviceB);
        JCuda.cudaFree(deviceC);
    }```
Where is my problem? Thanks for the help.

Hello,

To my understanding, the cudaConfigureCall/cudaSetupArgument/cudaLaunch sequence is more or less only a replacement for the “kernel<<<…>>(…)”-style invocation. I don’t know how the name that is given as the argument for ‘cudaLaunch’ is resolved internally, but I assume that it will not work since there is no C compilation+linking involved when using JCuda.

Actually, that’s why the new feature of mixing Runtime and Driver API was so important for JCuda: The only way to invoke own kernels is to use the Driver API, and for earlier versions this meant that it was not possible to apply an own kernel to data that was, for example, pre-processed with JCublas.

So for the actual invocation of own kernels, you still have to use the Driver API, because only the driver API allows you to explicitly load a module using cuModuleLoad and access the functions that it contains.

Interestingly, there is no example in the NVIDIA SDK which really uses cudaLaunch. Maybe I can try to build such a sample and do some tests whether this sort of invocation would be possible in CUDA-C…

bye

Hello, Marco.
Thank you for your reply.
I think that such way of kernel invocation is very interesting for writing simple code, especially for newers who I am :slight_smile:

Interestingly, there is no example in the NVIDIA SDK which really uses cudaLaunch

I also haven’t found any examples of using cudaLaunch in the NVIDIA SDK. Actually, CUDA specifications and articles about CUDA are seem to me quitesuperficial. It’s very hard to find detailed information on some questions.
Could you please give me some linkes on more detailed specifications?

It’d be very exciting to know about the results of your tests(whether this sort of invocation would be possible)

Thank you. Bye.

Hello,

I ran a test using the simple “vectorAdd” example from the SDK. In plain C it works when replacing the actual kernel invocation


    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

with this code

    cudaConfigureCall(blocksPerGrid, threadsPerBlock, 0, NULL);
    int offset = 0;
    void* ptr;
    ptr = (void*)(size_t)d_A;
    ALIGN_UP(offset, __alignof(ptr));
    cudaSetupArgument(&ptr, sizeof(ptr), offset);
    offset += sizeof(ptr);
    ptr = (void*)(size_t)d_B;
    ALIGN_UP(offset, __alignof(ptr));
    cudaSetupArgument(&ptr, sizeof(ptr), offset);
    offset += sizeof(ptr);
    ptr = (void*)(size_t)d_C;
    ALIGN_UP(offset, __alignof(ptr));
    cudaSetupArgument(&ptr, sizeof(ptr), offset);
    offset += sizeof(ptr);
    ALIGN_UP(offset, __alignof(N));
    int ip = N;
    cudaSetupArgument(&ip, sizeof(int), offset);
    offset += sizeof(N);
    cudaLaunch("VecAdd");

(the kernel had to be declared as ‘extern “C”’ for that, to avoid name mangling)

But I also tried removing the kernel from the source code, store it in a separate file, compile it into a CUBIN, load it with the driver API and then execute the above code block, but, as expected, it does not find the kernel this way.

When CUDA C code is compiled with NVCC, there are MANY (many many many) things happening under the hood. You can pass the “–keep” parameter to the NVCC call, and will see that it generates dozens of large intermediate and stub files with lots of wierd and magic code… I assume that the actual String “VecAdd” can not be resolved when all this does not happen while compiling and linking.

Unfortunately, I don’t know any more specific documentation and help concerning cudaLaunch. The Reference Manual only mentions “Yes, these functions do exist”. The Programming Guide does not mention them at all. And no samples are using them. A websearch mainly brought up forum questions about how to use it … -_-

So in the end, I think you will have to use the Driver API to launch the kernels. Actually, I think this is not such a large drawback, since (as you can see in the above code example) the setup for the function call is nearly the same for the Runtime and the Driver API. The difference are maily some replacements, e.g. using “cudaSetupArgument” instead of “cuParamSetv”.

But I agree: The invocation of kernels using the Driver API (or the Runtime API as described above) is tedious: When changing one kernel argument, you have to go again through all the calls of ‘cuParamSetv’ and so on. That’s why I once created the KernelLauncher example, which shows how this could be simplified. This class is not an “official” utility class, but only an example. I want to improve it to become more stable, and make it a real utility class - admittedly, this task is pending for quite a while now :o but I see that there is a strong demand for such a utility, so I’ll increase its priority. In the meantime, you might want to have a look at the current KernelLauncher. The “final” version will probably be very similar to the version that is already available.

bye

Hello, Marco.

Thanks for your detailed answer. Now I got everything clear now.

You can pass the „–keep“ parameter to the NVCC call, and will see that it generates dozens of large intermediate and stub files with lots of wierd and magic code…
I’ve tried to do that. Magic thing are happening, really Maybe I’ll write to CUDA developers they to draw more attention on specifications they write. But I suppose my oppinion isn’t of great importance.
I used KernelLaunch for my program. That’s very convenient and hides DriverAPI work. Thanks for that class!

Are you the only one JCude developer?

Hello,

Well, I’m not sure if there is something to criticise: The functions (cudaLaunch) exist, and they are working. For the case of JCuda, it might be hlepful to know more about the internals of the NVCC and how the function names are resolved internally. But admittedly: Even if this is documented somewhere, I’d probably not understand it :o It looks like a complex thing, according to the output that is created with --keep. And in the end, for the pure application of the functions, it is not really necessary to know all that…

At the moment, I’m the only developer working on JCuda. But I already started “cleaning up” some of the code and structures that are used internally, so that JCuda may sooner or later be put on a public repository, and others can participate in the development.

bye