Crashing Java SE

MrSampson · 12. August 2011 um 05:13

Hi there,
I’m using JOCL 1.6 with the (brand spanking new) NVIDIA OpenCL 1.1 drivers on Java 6.26 SE, and I’ve bumped into a problem that I can’t get around.

When enqueueing (and presumably launching) one of my kernels, it crashes the Java Platform.

Everything else works just fine (program build, kernel definition/creation, context definition and so on), but I can’t uncrash the platform to see what’s really going on.

The call looks like this:
clEnqueueNDRangeKernel(commandQueue, kernelCentroidCalcEvenM, 3, null, gwsCentroidCalc, null, 0, null, null);

The kernel definition like this:

"__kernel void gpuCentroidCalc(										"+
			"			__global 	float*	a,									"+
			"			__global	float*	aCalc,								"+
			"						int		n,									"+
			"						int		k,									"+
			"						int		d,									"+
			"			__global	int*	kCurrent,							"+
			"			__global	int*	CentroidCount,						"+	
			"			__global	float*	centroids) {						"+
...

Any tips on how to debug this thing and get it to not crash the Java platform?

Many thanks,
Oliver Sampson

Marco13 · 12. August 2011 um 06:00

Hello

I have already downloaded, but not yet tested the 1.1 drivers, but I allocated some time for this for next week. Maybe I can already do some first tests today. In any case, it would be helpful to know whether the same code already worked with the 1.0 drivers (given that it does not use 1.1-features, of course). Otherwise, chances are high that there is only something wrong with the kernel, like writing outside of the bounds of an array or so. Did simple kernels (i.e. some of the samples from the website) work with the 1.1 drivers, or is it a general problem? Anyhow, I’ll try to do some tests as soon as possible, and inform you here whether it works or whether the binaries will be updated for the 1.1 drivers next week.

bye
Marco

MrSampson · 12. August 2011 um 06:37

Thanks for the quick reply! I hadn’t tested with the 1.0 drivers (and I don’t suspect that rolling it back will be all that easy.)

What makes me think it’s more of a JOCL issue than an NVIDIA OpenCL issue, is that the crash is happening in the Java VM. (I haven’t quite peered into the JOCL internals, but I assume that JOCL is not much more than a wrapper for the calls that get sent to the OpenCL subsystem.) Previously any syntax errors or calling errors would be returned by the OpenCL real-time compiler, but with this situation it doesn’t even get that far.

Of course, there could be other things going on in the JOCL code of which I’m not aware.

MrSampson · 12. August 2011 um 06:55

So, I went back for another look to see if there was something stupid in my code that I missed, and sure enough there was.

When creating the kernel, I had forgotten to add a line to include a new parameter, meaning that the kernel in the string had one more parameter than my kernel definition. Now it runs, and I can start debugging again!

However, I don’t think that this is the kind of problem that should cause the Java VM to crash…

Thanks for all the great work on JOCL!

Marco13 · 12. August 2011 um 07:25

EDIT: Wo-ho - I overlooked your last post, sorry…

Hello

Of course you should not roll back to older drivers! If necessary, I would update the JOCL binaries for the new version.

I just quickly installed the latest drivers (280.26, Windows 32bit) and tested the „JOCLSample_1_1.java“ from the Samples at jocl.org - Samples . It’s only a very small test with a very simple kernel, but it involves some of the features only available in OpenCL 1.1, and for me it worked with the NVIDIA platform (GeForce 8800) as well as the AMD Platform (using the CPU).

And to emphasize this: I did not want to say that it’s an issue of the NVIDIA OpenCL drivers. It might have been an incompatibility between the JOCL library and the latest version of the drivers. But from the first test, this does not seem to be the case. (Of course, there might be something wrong with JOCL, but at least it does not seem to be a general problem with the new drivers).

You’re right: JOCL intentionally is only a very, very thin layer around the OpenCL functions. It does not involve any additional error- or sanity checks and no additional convenience functions (although, I allocated the time for JOCL next week also in order to make another attempt for wrapping the Object-Oriented JOCL from http://jogamp.org/jocl/www/ around my low-level JOCL…).

What I wanted to say: I have also seen crashes of the Java VM. Nasty, nasty crashes But until now, in every case, I found out that I made a mistake in the kernel. In most cases, as I already mentioned, this happened due to null pointer accesses or writing outside of array bounds. OpenCL can detect pure compilation errors, but unfortunately, other errors will not cause a gracious ‚NullPointerException‘ or ‚ArrayIndexOutOfBoundsException‘ to be thrown, but instead vaporize the VM.

At the moment, it’s not possible to definitely say whether the reason for the error is inside JOCL or inside the kernel. One easy next step could be that you test one of the samples from the Website, to see whether it works for you in general. The following step could be that you provide a small example of a program that calls your kernel and reproducably crashes. (If you don’t want to post the code here, you can send me a PM or mail).

bye
Marco

BTW: I have a strategy for writing own kernels, which of course is not applicable for every type of kernel, but helped me a lot in the beginning: I have created an abstract class which offers methods like the built-in OpenCL functions („get_global_id(int dim)“ etc…) and some additional helper functions, and which allows writing java code that is at least very similar to the final OpenCL code. I already started to extend this class in order to provide a „framework“ for quick development of simple kernels (including benchmarking and debugging), but this is still far from being really applicable. Maybe one day I’ll find the time to extend and publish it…

Marco13 · 12. August 2011 um 07:39

Well, it’s hardly possible to detect whether, for example, the given arguments match the arguments of the kernel. When you write code that, for example, accesses an array outside of its bounds, then anything can happen, and the VM crash is still by far the most „safe“ thing that can happen.

I also experienced hangs which made it necessary to reboot … you know: When the mouse pointer is not moving any more… On newer Windows versions this is still more gracious, showing this message ~„The display driver stopped responding…“, but on XP it might really cause nasty errors.

Some of these errors might be prevented by an OO-layer. For example, for something like

DeviceMemory d = allocate(100); // allocate 100 bytes
copyHostToDevice(h,d,200); // write 200 bytes

the memory size could be stored and verified. With JOCL, one only calls clEnqueueWriteBuffer, and if the memory size is wrong, it crashes. Just like a C-application would do in this case.

But even with these checks that could be added in an OO-Layer: In the kernel, everybody is free to write anything…


int *pointer = 0;
pointer[-123] = 666;
while (true) { /* do nothing */ }

… and no one can prevent this…

bye
Marco