Jcuda kernel doesn't work for certain dimension of data

I am using the JCUDA to do image processing acceleration
The parallel algorithm is programmed as a cuda kernel in C. it works for the image in dimension of 816612 and 16321224. However, when I change the target image into one with dimension of 408*306, the kernel doesn’t work, and the error message from JCUDA is as below:

Exception in thread “main” jcuda.CudaException: CUDA_ERROR_UNKNOWN
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:282)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1795)

The kernel calling code is as below:
Pointer kernelParameters = Pointer.to(
Pointer.to(deviceInputsdata),
Pointer.to(deviceInputslist),
Pointer.to(deviceInputneighborshift),
Pointer.to(deviceInputbuckets),
Pointer.to(deviceInputweightMap),
Pointer.to(deviceInputnBuck),
Pointer.to(new int[]{numElements}),
Pointer.to(new int[]{bucketer.NeighborNum}),
Pointer.to(new float[]{sigmaS}),
Pointer.to(new float[]{sigmaR}),
Pointer.to(new float[]{smin}),
Pointer.to(deviceOuputmsRawData),
Pointer.to(deviceInputmodeTable),
Pointer.to(new int[]{Width})
);
int blockSizeX = 32;
int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
cuLaunchKernel(function,
gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
cuCtxSynchronize();
I have tried different block size such as 32, 64 and 128. None of them works for me.

I don’t understand why this kernel works for image of bigger dimension while failed in the case of the smaller one.

Please help me! Thanks

Hello

Unfortunately, CUDA_ERROR_UNKNOWN is not very specific and thus does not say very much. It may have several reasons, and one of them is that an invalid memor region is written. So one of the first debugging steps could be to comment out the whole kernel contents…

extern "C"
__global__ void theKernel(...)
{
    /* do nothing ...

     int x = blockIdx.x * blockDim.x + threadIdx.x; 
     int y = blockIdx.y * blockDim.y + threadIdx.y; 
     ...


    */
}

and see whether the error still happens.

Although (according to the parameter list) the kernel might be quite complex, some more information about what the kernel does might be helpful for further steps…

bye
Marco

Hi Marco

I have tried the way you said to comment all the body part of the kernel and it works without any errors

You know the most weird thing is that the kernel works well for the some images with big dimension. While it collapsed when the image is down-sampled into smaller one.

The kernel is trying to execute the mean shift clustering algorithm for each pixel in the image.
The full version of kernel code is too lengthy to be put here and I attach it to this reply.
Thanks very much!

[QUOTE=Marco13]Hello

Unfortunately, CUDA_ERROR_UNKNOWN is not very specific and thus does not say very much. It may have several reasons, and one of them is that an invalid memor region is written. So one of the first debugging steps could be to comment out the whole kernel contents…

extern "C"
__global__ void theKernel(...)
{
    /* do nothing ...

     int x = blockIdx.x * blockDim.x + threadIdx.x; 
     int y = blockIdx.y * blockDim.y + threadIdx.y; 
     ...


    */
}

and see whether the error still happens.

Although (according to the parameter list) the kernel might be quite complex, some more information about what the kernel does might be helpful for further steps…

bye
Marco[/QUOTE]

OK, the Kernel is indeed rather complex. I wonder whether it might be possible to split it into smaller functions. Or add some comments. Or use variable names other than „lN“, „yK“ or „hiLTr“. But regardless of that: One can almost be sure that somewhere an attempt is made to write to an invalid memory location. It’s a pity that there is no ArrayIndexOutOfBoundsException pointing at the right line :wink: From quickly skimming over the code, I noticed noting obvious, but of course, there are many parameters, and some of them contain values that are used as indices. Maybe I can have another look tomorrow, but will most likely not be able to really „verify“ the kernel code anyhow…

Dear Marco
Thank you very much for your great help.
you said the variable names like „lN“, „yk“ should not be used. Is there any rules for the name of variables in Jcuda?
And I still quite not get it that why this kernel works for a certian dimension of images but fails to other dimension. If there is attempt to write a invalid memory location, should it be the same occasion to all input image matrix with any dimensions?

Again, thanks so much. I really appreciate your efforts.

Lei Wang

Sorry, there are no specific „rules“ about the variable names. I only mentioned this because nobody knows what „lN“ or „yk“ stands for. When they are called „numberOfNeighbors“ or „sizeOfArray“, it might be easier to find the error :wink: But since the kernel is too complex to be „verified“ by just reading it, there is no urgent need to rename the variables anyhow.

And I still quite not get it that why this kernel works for a certian dimension of images but fails to other dimension. If there is attempt to write a invalid memory location, should it be the same occasion to all input image matrix with any dimensions?

Writing to an invalid memory location may cause what is usually called „unspecified behavior“: It might cause wrong values to be returned from the kernel. In other cases, it might seem to work without an error. Or it might cause a crash if the memory location is required for some other computation.

To emphasize this once more: The „CUDA_ERROR_UNKNOWN“ may have many possible reasons (and probably, only the people at NVIDIA really know which reasons). But in all cases that I have experienced so far, this error was related to an attempt to write to invalid memory locations. I do not know that this is the case here as well, but I assume that this is the reason.

Theoretically, the easiest way to detect such an error is to use CUDA-MEMCHECK from http://docs.nvidia.com/cuda/cuda-memcheck/index.html (it is already contained in the CUDA Toolkit). It is possible to use this for JCuda applications as well, by specifying a Batch file as the „application“ to be started. For example, to debug the „JCudaVectorAdd“ example from the website, you can create a file called „runTest.bat“ that starts the program:


java -cp .;jcuda-0.5.0a.jar JCudaVectorAdd

And then run CUDA-MEMCHECK by typing


cuda-memcheck runTest.bat

at the console.

If there are writes to invalid memory locations, it will print something like


========= Invalid __global__ write of size 4
=========     at 0x000000a8 in add
=========     by thread (9,0,0) in block (0,0,0)
=========     Address 0x2001ffffc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:nvcuda.dll (cuLaunchKernel + 0x1e2) [0xf612]
=========     Host Frame:JCudaDriver-windows-x86_64.dll (Java_jcuda_driver_JCudaDriver_cuLaunchKernelNative + 0x16b) [0x85bb]
=========     Host Frame:[0x25823a8]
...

One still has to interpret this output, but at least one can see whether something is wrong, as long as it does not only print


========= ERROR SUMMARY: 0 errors

I have not yet tested this for more complex applications, but in general, it could be a first step to see whether there are invalid writes or not.

bye
Marco

Hi Marcos
Okay I got your point, I will try to make the variables’ name more reasonable.
Besides, I have tried to create the bat file. However, I always get the error when I want to execute it like below:
“C:\Windows\system32>cuda-memcheck runMeanshift.b
========= CUDA-MEMCHECK
========= Error: Cound not run runMeanshift.bat”
Should I put the bat file in a specified path?

Thanks
Lei Wang

You should start cuda-memcheck from the directory where your project is - and the BAT file should probably be in the same directory as your project. (Your project is most likely NOT in C:\Windows\system32 :wink: )

After opening the console:
**C:\Windows\system32>***cd*
**C:>**cd WhereYourProjectIs
**C:\WhereYourProjectIs>**cuda-memcheck theBatFile.bat

Hello

I’m curious whether this problem was solved. If the error still occurs even if there are NO invalid memory accesses, please let me know.

bye

Dear Marco
Thanks for caring. Sorry for not replying you as soon as possible.
I have tried the cumem check function. There are some invalid read in global memory space. I try to use the Devicequery sample to check the allowable global memory size for my GPU but I didn’t find any useful information.
At the same time, the program sometimes works and sometimes fails if started by eclipse. However, the program is doomed to fail if started from cmd console. I totally get lost about what is happening.
Please help me. Thanks!

Lei Wang

[QUOTE=Marco13]Hello

I’m curious whether this problem was solved. If the error still occurs even if there are NO invalid memory accesses, please let me know.

bye[/QUOTE]

Of course you don’t have to respond, especially when you are busy with hunting the bug :wink: I’m not sure how I should help you. On the one hand, because I can not test the kernel, and on the other hand: Even if I COULD test it, I don’t know what it is supposed to do and how it was implemented. Didn’t the memcheck-output help to find the line of the kernel where an invalid memory location is accessed…?

I just had some new findings.
If the dimension of the image is small, I have to set a smaller blocksize like 16 to make it work.
To the contrary, if the image is big, I have to set a bigger blocksize like 32 or 64 to make it work.
Is there any rules about this? Like the dimension of the data should be a integer times of the blocksize?

Lei Wang

There are no “general” rules about that, because you can choose much of the data layout on your own.

But now when you mention this: It might(!) have a simple reason (I should have noticed this earlier :o )

In your kernel, you are computing the thread index, and accessing the “modeTable” using this index:

        int i = blockIdx.x * blockDim.x + threadIdx.x;
....
        if(modeTable1D**!=1)

Depending on the size of the modeTable, this memory location might simply be invalid.

Assume that the size of the moduleTable1D is 100, and the blockSizeX is 32. Then this computation is performed:

int numElements =  100;
int blockSizeX = 32;
int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);

So the gridSizeX will be 4. That means that in the computation
int i = blockIdx.x * blockDim.x + threadIdx.x;
the value of ‘i’ will be 0 to 4*32 = 128 - and thus, some accesses will be outside of the “modeTable1D” !

Many kernels start with an explicit check whether the thread index is inside of the problem domain (see for example the http://jcuda.org/samples/JCudaVectorAddKernel.cu )

Thus, one step for resolving the problem could be to simply do the same check in your kernel:

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n)
{
    return; // Do not access elements that are outside of the modeTable1D!
}
....
if(modeTable1D**!=1)
...

(I assume that ‘n’ is actually the size of the modeTable1D here).

But of course, there might still be other errors. At least you might see in the memcheck-output whether the number of invalid read/write operations is reduced when you add this test.

Hi Marcos
This is very helpful information. I neglected the ceil function effect.
I did the modification by adding the thread index checking sentence and simplify the kernel a little bit.
It works!! I will try the full version of kernel later. I will let you know if there are any further problems.
Thanks very very much. You are such a amazing expert!!

Lei Wang

Dear Marco
Thanks for you great help last time.
Today I have another problem when simulating a new algorithm.
I have a cu file containing a extern C function which is supposed to run on CPU to operate the kernel.
Is there any way in Java to call such a function running on CPU in a cu file?

I hope I have made myself understood.

Thanks!
Lei Wang

Dear Marco
Thanks for you great help last time.
Today I have another problem when simulating a new algorithm.
I have a cu file containing a extern C function which is supposed to run on CPU to operate the kernel.
Is there any way in Java to call such a function running on CPU in a cu file?

I hope I have made myself understood.

Thanks!
Lei Wang

[QUOTE=novakwang]Hi Marcos
This is very helpful information. I neglected the ceil function effect.
I did the modification by adding the thread index checking sentence and simplify the kernel a little bit.
It works!! I will try the full version of kernel later. I will let you know if there are any further problems.
Thanks very very much. You are such a amazing expert!!

Lei Wang[/QUOTE]

Hello

I’m not entirely sure: Isthe function that you want to call a CUDA kernel, or is it just a normal C function?

But regardless of that, as far as I know, there is probably no practical way to execute such a function from Java. (Theoretically, it MIGHT be possible to execute kernels on the CPU with things like http://code.google.com/p/gpuocelot/ but I can not imagine how this whould work from Java, so there is certainly no way that practically applicable).

I can imagine that there are many possible reasons why you want to execute the C function, but … isn’t it possible to port this function from C to Java?

bye

Hi Marcos
Thanks very much for you explaining.
I understand it and I will try to port this function to java.

Best
Lei Wang

[QUOTE=Marco13]Hello

I’m not entirely sure: Isthe function that you want to call a CUDA kernel, or is it just a normal C function?

But regardless of that, as far as I know, there is probably no practical way to execute such a function from Java. (Theoretically, it MIGHT be possible to execute kernels on the CPU with things like http://code.google.com/p/gpuocelot/ but I can not imagine how this whould work from Java, so there is certainly no way that practically applicable).

I can imagine that there are many possible reasons why you want to execute the C function, but … isn’t it possible to port this function from C to Java?

bye[/QUOTE]

Marco,
Since you had referred me to this topic for cuda-memcheck utility, I have a question that I need to ask. (if this is the wrong place, feel free to move this reply to the correct place)

Currently using cuda-memcheck, I only get the PC address of where the error occurs (eg. „at 0x000000b0 in evaluate“)
I saw in memcheck’s docs that if the kernel is compiled using line information, I can see the actual line number of where the error occurred.

I tried to create KernelLanucher object as follows:

kernel = KernelLauncher.create(„bin/facefilter/FaceFilterpp.cu“, „evaluate“, true, „-arch=sm_21“, „-lineinfo“);

but still during memcheck I do not see any line number :frowning: (I am creating PTX obviously)
Is there a specific trick to it in JCuda or am I doing something wrong here?
If there is no way to see the actual line number, how can I interpret the PC address to an actual instruction?

Thank you very much in advance.

Hello

I have not yet used memcheck extensively myself, but will try to run a test in the next days, and see whether line numbers can be printed

bye