Recover device after crash


#1

Hi there

I’m setting up a testing framework and i was wondering if it’s possible to recover after the device has crashed.
For example when testing a specific kernel i sometimes get a CL_OUT_OF_RESOURCES which makes the gpu unable to do any other kind of work as long as it still uses the same context, commandqueue, etc.

I’ve now tried to destroy the context, … whenever such an exception is being trown. And i’m trying to remake everything. But when recreating the context i come across a CL_DEVICE_NOT_AVAILABLE.

Is there a way to reset a device so it can be used again after crashing? Because at the moment all the next testcases in row fail because they can’t use that device no more after crashing. :confused:

thanks in advance

Stef


#2

Hello

JOCL is a very thin layer around OpenCL. So everything that can go wrong when, for example, passing invalid pointers to OpenCL can equally go wrong when passing invalid Pointers to JOCL. In many cases this means that errors are hardly recoverable: In the worst case (on WinXP) the screen may freeze (and by “freeze” I mean that even the mouse pointer does not move any more). On newer Windows versions, these errors in the worst case cause the TDR-Message “The Display driver stopped responding…”. And it’s not unusual that an error in OpenCL causes the JVM to die painfully (with a hs_err*.txt message).

I assume that whether it’s possible to recover after an CL_OUT_OF_RESOURCES depends on what actually caused this error: The CL_OUT_OF_RESOURCES message seems to be some sort of “default” error message, that is returned when anything goes wrong that prevents the device from continuing to work. I’ve seen this error message in many situations. For example, as a response to writing memory outside of the bounds of a cl_mem-object. Actually, this has nothing to do with being “out of resources”, but this is what OpenCL then reports during the next kernel invocation attempt.

I could try to create some test-cases that provoke this error and do some experiments, but in any case, the behavior will depend on the exact error conditions. For example, when attempting to enqueue a kernel with a too large local memory size, it should return CL_OUT_OF_RESOURCES, but this should probably be recoverable. When the CL_OUT_OF_RESOURCES is returned due to an out-of-bounds-access, it might be un-recoverable, because writing to invalid memory locations may screw up OpenCL in a way that causes unspecified behavior anyhow.

Moreover, the exact behavior may even debend on the OS and OpenCL implementation. So at the moment, I cannot give any definite answer whether or how it is possible to recover from specific errors (or the unspecific ones like CL_OUT_OF_RESOURCES)

bye
Marco


#3

I run each kernel i created first on gpu, then on cpu. Both in OpenCL. If like you say the gpu crash would cause jocl to more or less die. Wouldn’t it be impossible for the cpu implementation to run?
Yet it does complete succesfully without any errors whatsoever.

Whilst keeping that in mind, wouldn’t it be theoretically possible to release all memobjects related to the context which crashed, release the commandqueue, the program and kernel objects of every “operation” and then finally release the context. And yet still be able to remake all of this using the deviceID and the platformID i saved earlier?


#4

On the one hand, I’m not sure about the level of separation between the GPU and the CPU implementation. Maybe AMD has managed to separate them in a way that allows the CPU part to continue its work, even if the GPU part crashed.

Apart from that, my remarks in the previous post also referred to the fact that the exact behavior will depend on the “type” of the crash. Not every exception is a “crash” in that sense. For example:

  • Passing a pointer like
    Pointer p = null;
    to one of the JOCL method will throw a NullPointerException, which is of course not critical in that sense

  • Passing a pointer
    Pointer p = new Pointer(); // This is a C-“NULL”-Pointer
    or a memory object
    cl_mem mem = new cl_mem(); // Invalid, because it is uninitialized
    to a JOCL method will cause something like a CL_INVALID_VALUE, which should also not be critical and should be recoverable.

  • Passing a
    cl_mem mem = new cl_mem(); // Invalid, because it is uninitialized
    as one argument to a kernel invocation is perfectly legal on the one hand. BUT if the kernel attempts to write to this memory object, the behavior will be unspecified - and may range from a plain crash of OpenCL and the JVM, or, not unlikely, cause a CL_OUT_OF_RESOURCES during the attempt to enqueue the next operation to the queue.

To my understanding, what you described should be possible: The memory obejcts, kernels etc. that are related to one context describe actual “resources” that can be released. In contrast to that, the platform_id and device_id should be pure identifiers, without an associated state, and thus be reusable. (I’d have to verify this by looking up the spec again, but that’s how I understood it until now)

bye
Marco