Error when launching kernel using JCuda

system · 15. Juli 2010 um 05:30

Hi,

I have just began working on CUDA and recently am exploring JCuda. I have written a small testing code to see how JCuda works. However, i have encountered an error exception which puzzles me. Attached is my java codes for the kernel calls. Apologise for the messy codes ><


                        int numOfBlocks = 1536, offset = 0, numThreadperBlock = 512;
                        String cubinFilename = "/home/alan/DNSKernel.cubin";
                        JCudaDriver.cuInit(0);
                        CUcontext context = new CUcontext();
                        CUdevice dev = new CUdevice();
                        JCudaDriver.cuDeviceGet(dev,0);
                        JCudaDriver.cuCtxCreate(context, 0, dev);

                        //Load the cubin file
                        CUmodule module = new CUmodule();
                        JCudaDriver.cuModuleLoad(module, cubinFilename);
                        //Create function pointer to cuda function
                        CUfunction function = new CUfunction();
                        JCudaDriver.cuModuleGetFunction(function, module, "gpuAvgRTTcompute");

                        //Allocate memory for input data on GPU device
                        CUdeviceptr dev07 = new CUdeviceptr();
                        CUdeviceptr dev08 = new CUdeviceptr();
                        CUdeviceptr dev09 = new CUdeviceptr();
                        CUdeviceptr tempResult = new CUdeviceptr();
                        CUdeviceptr devfinalResult = new CUdeviceptr();
                        JCudaDriver.cuMemAlloc(dev07, yr07Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(dev08, yr08Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(dev09, yr09Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(tempResult, numOfBlocks*Sizeof.FLOAT);
                        JCudaDriver.cuMemAlloc(devfinalResult, 3*Sizeof.FLOAT);

                        //Copy input  data to GPU device array
                        JCudaDriver.cuMemcpyHtoD(dev07, Pointer.to(yr07Qrt), yr07Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemcpyHtoD(dev08, Pointer.to(yr08Qrt), yr08Qrt.length*Sizeof.FLOAT);
                        JCudaDriver.cuMemcpyHtoD(dev09, Pointer.to(yr09Qrt), yr09Qrt.length*Sizeof.FLOAT);

                        //Pointer declarations
                        Pointer d07 = Pointer.to(dev07);
                        Pointer d08 = Pointer.to(dev08);
                        Pointer d09 = Pointer.to(dev09);
                        Pointer size07 = Pointer.to(new int[]{yr07Qrt.length});
                        Pointer size08 = Pointer.to(new int[]{yr08Qrt.length});
                        Pointer size09 = Pointer.to(new int[]{yr09Qrt.length});
                        Pointer tmpResult = Pointer.to(tempResult);
                        Pointer fResult = Pointer.to(devfinalResult);

                        JCudaDriver.cuFuncSetBlockShape(function, numThreadperBlock, 1, 1);

                        //Parameter setup for 1st kernel call
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d07, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d08, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, d09, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size07, Sizeof.INT);
                        offset += Sizeof.INT;
                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size08, Sizeof.INT);
                        offset += Sizeof.INT;
                        offset = JCudaDriver.align(offset, Sizeof.INT);
                        JCudaDriver.cuParamSetv(function, offset, size09, Sizeof.INT);
                        offset += Sizeof.INT;

                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, tmpResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        //Launch 1st kernel
                        JCudaDriver.cuParamSetSize(function, offset);
                        JCudaDriver.cuLaunchGrid(function, numOfBlocks, 1);
                        JCudaDriver.cuCtxSynchronize();

                        //Parameter setup for 2nd kernel call
                        JCudaDriver.cuModuleGetFunction(function, module, "gpuAvgRTTcompute2");
                        offset = 0;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, fResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;
                        offset = JCudaDriver.align(offset, Sizeof.POINTER);
                        JCudaDriver.cuParamSetv(function, offset, tmpResult, Sizeof.POINTER);
                        offset += Sizeof.POINTER;

                        JCudaDriver.cuParamSetSize(function, offset);
                        JCudaDriver.cuFuncSetBlockShape(function, numThreadperBlock, 1, 1);
                        JCudaDriver.cuLaunchGrid(function, 3, 1);
                        JCudaDriver.cuCtxSynchronize();

                        //Copy out the final results
                        JCudaDriver.cuMemcpyDtoH(Pointer.to(finalResults), devfinalResult,3*Sizeof.FLOAT);

However i get the error message:

Exception in thread “main” jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:153)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:723)
at dnstest.main(dnstest.java:185)
when running the program. I’m pretty sure the codes in the kernel is working as i’ve tested the codes in CUDA. Do appreciate any help here. Thanks alot.

system · 15. Juli 2010 um 05:33

sorry for the extra post… forgot to mention that the error is thrown at the first cuLaunchGrid() call.

Marco13 · 15. Juli 2010 um 11:51

Hi

I’m currently not at a CUDA capable PC, but can have a closer look at this on Sunday or Monday.

In any case, that’s a lot of code, and although (or because) it looks pretty straightforward it’s hard to guess what might be the reason for this error. Unfortunately, the CUDA_ERROR_LAUNCH_FAILED error is not very specific -_- … It might be helpful if I could to reproduce the error, but can’t make any promises right now. I’ll try to insert the code you posted into a test program (which calls an empty kernel), maybe it helps locating the error…

bye
Marco

system · 15. Juli 2010 um 18:55

Hi Marco,

Thx for the response. Yeah, sorry for having to do a code dump here as im not sure where the error lies in. Thinking of it, i thought i should provide some more info so as to help make the picture more complete. Below are the headers for the two GPGPU kernel methods that i have written:


extern "C"
__global__ void gpuAvgRTTcompute(float *g_input07, float *g_input08, float *g_input09,  int arrSize07, int arrSize08, int arrSize09, float *gtempArray){

extern “C”
global void gpuAvgRTTcompute2(float *g_output, float *gtempArray){



Essentially, what i was trying to do was to compute the a total sum of the values in each of the individual arrays (input07 08 09).
hmm..i havent thought of trying out with an empty kernel like u said.So i think i will also try your method out to see if i can identify the error. In anyway, thanks Marco! 

PS: JCuda is a really cool tool~ keep it up ^^

system · 15. Juli 2010 um 20:27

Hi,

Using Marco’s method, i manage to narrow down the error. It seems the error occurred because i failed to specify the size for a shared memory array used in the kernel codes.

Currently, i am trying to use the cuFuncSetSharedSize() method in the driver api to set the shared memory size. Will post more info later on.