Cudnn version 5

OK, then I’d really like to analyze this in a more reproducible and focussed way.

For potential bugs in JCuda, I consider the comparison to the original CUDA implementation as the „ground truth“: When a bug appears in JCuda, and the same bug happens in plain CUDA as well, then I lean back and say: "A pity - go, ask at the NVIDIA forum :stuck_out_tongue_winking_eye: " (well, more seriously: I’m usually trying to help where I can, but don’t feel so „responsible“ in these cases). When the bug ONLY happens in JCuda, then this may cause some sleepless nights for me…

In this case, I’d also apply this to performance issues. The MNIST example in JCudnn is largely a 1:1 port of the original cuDNN MNIST example. I’ll try to do a comparison of the performance of both. In its original form, the times will hardly be measurable, but maybe I can at least identify some „core“ (excluding setup+shutdown), and let this core be executed a few hundred times to get an averaged result.

Unfortunately, I’ll not be able to do this immediately, but depending on my progress with my other tasks, maybe I can give it a first try end of next week.

Alright. I have run JCudnn5 on Linux server with K40c and the performance is the same as JCudnn4. I guess this is due to cudnn5 not optmized for older architecture.

If you look at what Nvidia says about Cudnn5 on their homepage, it compares cudnn4 + K40 with cudnn5 + M40. That is just misleading.