JCuda[Driver].setExceptionsEnabled(false) still throwing expections

SirM2X · 28. April 2013 um 10:24

Hello,
Although I have these two lines in my code:

JCudaDriver.setExceptionsEnabled(false);
JCuda.setExceptionsEnabled(false);

JCuda is still throwing exceptions with a random black „null“ somewhere at the middle of its output! Like below:

This error appears when I try to run a very simple kernel that looks pretty OK but fails to run

__global__ void calculateFitness(int centroidCount, int dataPointCount,
						short ** relSet,
						double **distances,
						double *val
						)
{
	int index = threadIdx.x;
	if (index > centroidCount)
		return;

	// get number of points in the current cluster
	int count = 1;

	for( int i = 0 ; i < dataPointCount ; i++) {
		if (relSet**[index] == 1)
			count ++;
	}

	double temp = 0
	for( int i = 0 ; i < dataPointCount ; i++) {
		f (relSet**[index] == 1)
		{
			temp+ = distances**[index] / count;
		}
	}
	
	val[index] = temp;	
}```

Actually this kernel is called after 2 other kernels. The other two kernels finish without any problems.
This kernel is only called using a block size of 15 and with 1 block in grid. Also centroidCount is 15 and dataPointCount  is 5000. Both distances and relSet have 5000 rows and 15 columns.
I'm not sure what is wrong.
Thanks in advance

Marco13 · 29. April 2013 um 03:23

Hello

The ‚random black null‘ is usually an output that was made with System.out.println(…). Note that the console (in Eclipse or other IDEs) receives its input from two sources: From System.err and from System.out. So when you write something like

System.err.println("One");
System.out.println("Two");

there is no guarantee that ‚One‘ will be printed before ‚Two‘, because they come from different streams. And consequently, during a larger output to System.err (like the Stack Trace), there may be some outputs from System.out mixed in.

Tn this case, this null might also explain what is wrong there: If the ‚null‘ is the result of a debugging output like
System.out.println(oneKernelParameter);
then it would also explain why the LAUNCH_FAILED

Concerning the exception: The KernelLauncher is doing an additonal check, independent of the Low-Level checks that are controlled with ‚setExceptionsEnabled‘. For the low-level calls that directly go to CUDA, you can always disable the exceptions and then do manual checks:

int errorCode = 0;

errorCode = JCudaDriver.cuLaunchKernel(function, ...);
if (errorCode != CUresult.CUDA_SUCCESS) { reportSomeError(); }

(This would have to be done for each and every CUDA function call, which is rather inconvenient - that’s why I added the automatic check and the option to throw an exception in case of an error).

But for the KernelLauncher#call method, this is not possible. Of course, this method could also return an error code, but this would seem clumsy… (BTW: With earlier CUDA versions, the ‚call‘ method did in fact make several Low-Level CUDA calls, and each one could have caused an error…).

So in the end, I don’t think that it should be possible to disable the exceptions in the KernelLauncher as well, because then you have no possibility to detect whether there was an error or not…

bye

SirM2X · 29. April 2013 um 05:37

Marco,
Thank you very much for your response and elaboration
I am pretty sure the black „null“ is coming from somewhere inside JCuda (and not my code and I don’t output anything)
The exception idea is actually a very neat one. No one likes to manually check and see if an error has occured. In this case I was interested to see why my kernel wouldn’t execute.

On a side note, it seems that something had gone horrible wrong with CUDA. I finally made the kernel working by doing a very stupid change:
instead of using the variable „dataPointCount“ (which was passed by JCuda using an int value of 5000) I used the actual number 5000 in both of the loops and suddenly the problem is solved!!! I have no idea why something like that might happen!!

Thank you again for your time and response.

Marco13 · 29. April 2013 um 08:49

Well, at least the KernelLauncher does not contain any “System.*.print” statements, but I can check the JNI part again.

Of course it should be possible to pass a simple ‘int’ to the kernel and use it inside the Kernel accordingly. May I ask which environment (Operating System, Bitness…) you are using? Maybe I can create a small test to see whether there is a general problem wil passing int’s through the KernelLauncher (although I assume that I woul have noticed something like this earlier, but … you never know…)

SirM2X · 29. April 2013 um 08:52

It’s really funny! As I mentioned before, the kernel I’m using is called after 2 other kernel invocations. They all use the same "int"s as their arguments. But for whatever reason, this kernel fails to launch using those provided values.
I’m using Linux Mint 14 with CUDA 5 if that helps.

Marco13 · 29. April 2013 um 12:51

In general, it’s not a good sign when a kernel launch is affected by a previous launch: If one kernel A works, but does NOT work when it is launched after a kernel B, then it’s not unlikely that there is an error in kernel B. You might want to try running your program with CUDA-MEMCHECK, as described here: http://forum.byte-welt.de/showthread.php?p=19277#post19277 (I should probably add some information about this on the website…). But of course, this is just a guess, and a first attempt to explain this seemingly unexplicable behavior.

SirM2X · 29. April 2013 um 16:13

Thank you very much Marco
Everytime I post here I learn something new… You’ve been really helpful as always.

Cheers