[QUOTE=Marco13]Hardly anybody knows that Microsoft and NVIDIA are doing with their drivers „under the hood“. I still wonder where these 10-15% come from … and how you are measuring it - did you have dedicated timings/benchmarks under Windows 7 that you can now compare with your Windows 10 results?
Regarding cuDNN: I’m not aware of any further documentation. As I said when the cuDNN wrapper was requested for the first time:
Although „Deep Learning“ has become some sort of hype recently, the details are quite hard to grok, and particularly the details of the cuDNN API.
What does this refer to? Websearches involving JBlas and cuDNN do not bring any obvious results…
There certainly are people who really use cuDNN -I guess, often in close collaboration with NVIDIA. But actual cuDNN code samples are rare. So right now, I could only throw in some websearch results, and do it like the authors: Here you are - and make sure to get the parameters right, y’know? :rolleyes: The most serious application of cuDNN that I’m aware of is in caffe, but the cuDNN backend is only a tiny part of that, and the source code does not look like something to quickly get started with.
For my side, I have to admit that I created the cuDNN wrapper upon request and ported the sample, but did not try to actively use cuDNN for custom projects until now. It would indeed be great if someone with a deeper understanding could provide some information (samples, tutorials) here…[/QUOTE]
I meant to say JCuda did the porting – i.e. you. I got twisted with my words since I got both JBlas and JCuda on my mind currently. 
I appreciate the feedback. It is a frustrating experience with learning cuDNN. I am trying to implement a domain specific language for deep learning and so far I got good results for CPU version but I want to run on GPU since it is 50 times faster than my CPU version. If nobody has documentation/experience writeup on cuDNN, I will see what I can gather after learning it.
As to Caffe, its code is hard to decipher, which is why I want to do a DSL on deep learning in the first place.
About the performance differential between Win7 and Win10, the 10~15% is based on my memory since I no longer have a Win7 to compare with but I have detailed timers for each and every operation I do with LeNet. Right now, I only call GPU for matrix product and I recall I am getting like 30 ~ 40 seconds for the portion on GPU but now it is between 40 and 50 seconds.