Error when launching kernel using JCuda

Hi,

I have just began working on CUDA and recently am exploring JCuda. I have written a small testing code to see how JCuda works. However, i have encountered an error exception which puzzles me. Attached is my java codes for the kernel calls. Apologise for the messy codes ><


                        int numOfBlocks = 1536, offset = 0, numThreadperBlock = 512;
                        String cubinFilename = "/home/alan/DNSKernel.cubin";
                        JCudaDriver.cuInit(0);
                        CUcontext context = new CUcontext();
                        CUdevice dev = new CUdevice();
                        JCudaDriver.cuDeviceGet(dev,0);
                        JCudaDriver.cuCtxCreate(context, 0, dev);

                        //Load the cubin file
                        CUmodule module = new CUmodule();
                        JCudaDriver.cuModuleLoad(module, cubinFilename);
                        //Create function pointer to cuda function
                        CUfunction function = new CUfunction();
                        JCudaDriver.cuModuleGetFunction(function, module, "gpuAvgRTTcompute");

                        //Allocate memory for input data on GPU device
                        CUdeviceptr dev07 = new CUdeviceptr();
                        CUdeviceptr dev08 = new CUdeviceptr();
                        CUdeviceptr dev09 = new CUdeviceptr();
                        CUdeviceptr tempResult = new CUdeviceptr();
                        CUdeviceptr devfinalResult = new CUdeviceptr();
                        JCudaDriver.cuMemAlloc(dev07, yr07Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(dev08, yr08Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(dev09, yr09Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(tempResult, numOfBlocks*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(devfinalResult, 3*Sizeof.FLOAT);

                        //Copy input  data to GPU device array
                        JCudaDriver.cuMemcpyHtoD(dev07, Pointer.to(yr07Qrt), yr07Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemcpyHtoD(dev08, Pointer.to(yr08Qrt), yr08Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemcpyHtoD(dev09, Pointer.to(yr09Qrt), yr09Qrt.length*Sizeof.FLOAT);

                        //Pointer declarations
                        Pointer d07 = Pointer.to(dev07);
                        Pointer d08 = Pointer.to(dev08);
                        Pointer d09 = Pointer.to(dev09);
                        Pointer size07 = Pointer.to(new int[]{yr07Qrt.length});
                        Pointer size08 = Pointer.to(new int[]{yr08Qrt.length});
                        Pointer size09 = Pointer.to(new int[]{yr09Qrt.length});
                        Pointer tmpResult = Pointer.to(tempResult);
                        Pointer fResult = Pointer.to(devfinalResult);

                        JCudaDriver.cuFuncSetBlockShape(function, numThreadperBlock, 1, 1);

                        //Parameter setup for 1st kernel call
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d07, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d08, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d09, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size07, Sizeof.INT);
                        offset += Sizeof.INT;
                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size08, Sizeof.INT);
                        offset += Sizeof.INT;
                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size09, Sizeof.INT);
                        offset += Sizeof.INT;

                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, tmpResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        //Launch 1st kernel
                        JCudaDriver.cuParamSetSize(function, offset);
                        JCudaDriver.cuLaunchGrid(function, numOfBlocks, 1);
                        JCudaDriver.cuCtxSynchronize();

                        //Parameter setup for 2nd kernel call
                        JCudaDriver.cuModuleGetFunction(function, module, "gpuAvgRTTcompute2");
                        offset = 0;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, fResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, tmpResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        JCudaDriver.cuParamSetSize(function, offset);
                        JCudaDriver.cuFuncSetBlockShape(function, numThreadperBlock, 1, 1);
                        JCudaDriver.cuLaunchGrid(function, 3, 1);
                        JCudaDriver.cuCtxSynchronize();

                        //Copy out the final results
                        JCudaDriver.cuMemcpyDtoH(Pointer.to(finalResults), devfinalResult,3*Sizeof.FLOAT);

However i get the error message:

Exception in thread “main” jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:153)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:723)
at dnstest.main(dnstest.java:185)
when running the program. I’m pretty sure the codes in the kernel is working as i’ve tested the codes in CUDA. Do appreciate any help here. Thanks alot.

sorry for the extra post… forgot to mention that the error is thrown at the first cuLaunchGrid() call.

Hi

I’m currently not at a CUDA capable PC, but can have a closer look at this on Sunday or Monday.

In any case, that’s a lot of code, and although (or because) it looks pretty straightforward it’s hard to guess what might be the reason for this error. Unfortunately, the CUDA_ERROR_LAUNCH_FAILED error is not very specific -_- … It might be helpful if I could to reproduce the error, but can’t make any promises right now. I’ll try to insert the code you posted into a test program (which calls an empty kernel), maybe it helps locating the error…

bye
Marco

Hi Marco,

Thx for the response. Yeah, sorry for having to do a code dump here as im not sure where the error lies in. Thinking of it, i thought i should provide some more info so as to help make the picture more complete. Below are the headers for the two GPGPU kernel methods that i have written:


extern "C"
__global__ void gpuAvgRTTcompute(float *g_input07, float *g_input08, float *g_input09,  int arrSize07, int arrSize08, int arrSize09, float *gtempArray){

extern “C”
global void gpuAvgRTTcompute2(float *g_output, float *gtempArray){



Essentially, what i was trying to do was to compute the a total sum of the values in each of the individual arrays (input07 08 09).
hmm..i havent thought of trying out with an empty kernel like u said.So i think i will also try your method out to see if i can identify the error. In anyway, thanks Marco! 

PS: JCuda is a really cool tool~ keep it up ^^


Hi,

Using Marco’s method, i manage to narrow down the error. It seems the error occurred because i failed to specify the size for a shared memory array used in the kernel codes.

Currently, i am trying to use the cuFuncSetSharedSize() method in the driver api to set the shared memory size. Will post more info later on.