As far as I know, there is no way of synchronizing between kernels inside kernels. All synchronization has to take place on command queue level, for example, using events. So the synchronization here could possibly done with events - not exactly like you described, but ROUGHLY like that:
- Create 10 user events
- Enqueue the kernel0 on queue0
-- Once with an "eventWaitList" that contains event0
-- Once with an "eventWaitList" that contains event1
- Enqueue commands on queue1 that will set the status of the user events to 'COMPLETED' one after another
The "JOCLSample_1_1.java" from http://jocl.org/samples/samples.html shows some examples of user event handling, maybe you want to have a look at that.
However, the synchronization can be tricky if there is global memory involved. I'll have to look up the spec to see under which conditions (and how) this is possible, but it's not entirely clear whether you intended to use the global memory only to emulate a semaphore (which could be done using events), or whether you really wanted to use it for "communication" (in terms of data transfer)...?
The clCreateSubDevices method is part of OpenCL 1.2. Currently JOCL supports only OpenCL 1.1. I'm already working on (and basically finished) support for OpenCL 1.2, but since there are no official implementations for OpenCL 1.2, I have not yet updated it. The AMD drivers contain an OpenCL 1.2 preview, but only for AMD GPUs. At the moment, I'm using NVIDIA. However, once there is an OpenCL 1.2 implementation available, I'll finish and test the update of JOCL and upload the new version.