I’m not sure whether you know how the kernel invocation worked in CUDA 1.0-3.2: It was much more complicated and error-prone, because one had to specify the arguments individually, each with its size AND its alignment - it was a hassle. And it was one of the main reasons why I created the KernelLauncher utility class, that seems to be close to the „Builder style“ invocation of kernels and the varargs-kernel call that you proposed 
However, the ‚kernelArgs‘ pointer of CUDA 4.0 greatly simplified the kernel argument setup. Of course, I could have simplified it even further, and could have introduced an own ‚KernelArgs‘ class to encapuslate (and hide) this pointer to pointers. But this is not general solution. There are other methods that use Pointers to Pointers. For example, [,%20jcuda.Pointer%29"]cuModuleLoadDataEx](JCudaDriver (jcuda 11.0.0 API)[) - and unfortunately, for this method I felt the necessity to introduce the JITOptions class. I’m really not happy with that. Other methods need, arrays of pointers like [,%20int[],%20jcuda.jnpp.NppiSize%29"]nppiBGRToYCrCb420_709CSC_8u_AC4P3R](JNppi (JNpp API Documentation)[), and of course, more such methods could be introduced in future versions, even in the „core“ CUDA API. It should be possible to treat all these methods equally, as far as possible, and offer the same possibilities as the C-API. Simplifications here could be dangerous, because it’s hard to foresee how NVIDIA will change the CUDA API, and how such a „simplification“ may be adopted to such changes.
For the cuLaunchKernel case, I recently stumbled over something like that: I have an old GPU, with a low Compute Capability. And I did not consider the fact that newer devices with higher Compute Capability can allocate device memory IN kernels! The need to write back the pointer values into the pointer-to-pointers caused some headaches, but I hope that - although I’m not really satisfied with the current solution, and it needs to be cleaned up - this should now also work for cases where other methods might modify the „inner“ pointer values of a pointer-to-pointer.
In any case, I see your point about simplification (or maybe just adhering to Java conventions), but completely omitting the possibility to have pointer-to-pointers would reduce the expressiveness of the API in a way that simply can not be accepted.
One thing that I learned in my life as a software developer: When it seems that there is an easy solution, this solution is likely to break later 
EDIT: In fact, even THIS thread could be split further, because it’s not about deadlocks any more, but about Pointer-to-Pointer and API aspects… I’ll consider this… 