OK, then I’d really like to analyze this in a more reproducible and focussed way.
For potential bugs in JCuda, I consider the comparison to the original CUDA implementation as the „ground truth“: When a bug appears in JCuda, and the same bug happens in plain CUDA as well, then I lean back and say: "A pity - go, ask at the NVIDIA forum " (well, more seriously: I’m usually trying to help where I can, but don’t feel so „responsible“ in these cases). When the bug ONLY happens in JCuda, then this may cause some sleepless nights for me…
In this case, I’d also apply this to performance issues. The MNIST example in JCudnn is largely a 1:1 port of the original cuDNN MNIST example. I’ll try to do a comparison of the performance of both. In its original form, the times will hardly be measurable, but maybe I can at least identify some „core“ (excluding setup+shutdown), and let this core be executed a few hundred times to get an averaged result.
Unfortunately, I’ll not be able to do this immediately, but depending on my progress with my other tasks, maybe I can give it a first try end of next week.