Performance of JCuda on Windows 10

typecheck · 26. April 2016 um 16:10

I got a problem with performance of JCuda on Windows 10. I would appreciate it if anyone could give some help or suggestion.

I have been using JCudaBlas for matrix product on Windows 7 and got good performance.

However, after upgrading to Windows 10, the performance dropped significantly. Is there any reason for this?

The only relevant call I make is this:

cublasSgemm(handle, transA==‘n’? CUBLAS_OP_N:CUBLAS_OP_T, transB==‘n’? CUBLAS_OP_N:CUBLAS_OP_T, m, n, k, pAlpha, d_A, k, d_B, k, pBeta, d_C, m);

I did not change anything other than upgrading to Windows 10 (not by choice. The windows just did it over my objection).

I reinstalled Cuda 7.5 specifically for Windows 10 afterwards and it has no effect.

Marco13 · 27. April 2016 um 02:36

A remote diagnosis is difficult here (and I certainly won’t install Win 10 for testing this ;-)).

Pragmatic websearches about cuda OR cublas performance "windows 10" do not bring any specific results or bug reports…

The usual, vague guess would be that it might be related to the driver. I’m not sure what the Windows10 installation changes here. It may be the same driver in both cases, but IF not, consider updating to the most recent one.

A comparison to the native BLAS implementation might be interesting. Basically, comparing the execution time of the JCublasSample from jcuda.org - Samples to the “simpleCUBLAS” sample from ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\7_CUDALibraries\simpleCUBLAS. One would have to fiddle around a bit, to use the same matrix size and insert time measurements to really only measure the execution time of the SGEMM. However, this

would only show whether there is an obvious, “large” difference (>5%)
can not be compared to the times that the native example had on Windows 7
might not bring really relevant insights: Regardless of whether JCublas is slower than simpleCUBLAS or not, one does not know whether it was the same on Windows 7, or what might be the reason for this…

But in general, and maybe most importantly: What is “significant”? Are we talking about 5%, or 50%, or 500%? (And how are you measuring this?)

typecheck · 27. April 2016 um 09:55

Marco,

Greatly appreciate the reply. By significant, I mean 50%~100% slower. However, I did manage to fix it a little by upgrading to the latest Nvidia driver for Windows 10, which turns out to be different from the one for Windows 7. For some reason, M$ didn’t automatically upgrade the video driver for me.

It is still about 10~15% slower than before but not so bad now. I am new to this and didn’t know that CUDA 7.5 depends on video driver for GPU computing. I thought CUDA had direct access to GPU.

==================================================================================================================================

A unrelated question about JCUDNN, is there any documentation other than the example?

The reason I ask is that the sample program is for forward pass only and there are numerous parameters for the functions for backwards gradients that I am not sure how to parse.
The documentation from Nvidia is quite sparse in explanation. I couldn’t locate more insightful documentation elsewhere and yet there are a number libraries having had successful implementation using CUDNN.

Since JBlas had ported CUDNN, I am sure someone here must have a good understand of usage. Perhaps one has a summary willing to share?

Many thanks.

Marco13 · 27. April 2016 um 13:22

Hardly anybody knows that Microsoft and NVIDIA are doing with their drivers „under the hood“. I still wonder where these 10-15% come from … and how you are measuring it - did you have dedicated timings/benchmarks under Windows 7 that you can now compare with your Windows 10 results?

Regarding cuDNN: I’m not aware of any further documentation. As I said when the cuDNN wrapper was requested for the first time:

[QUOTE=Marco13]…for the case of cuDNN, I 'm not convinced about the usefulness: There is a single sample, and I doubt that there will be more in the near future. The „core“ functions of the given sample are basically a chain of (uncommented) calls to methods with cryptic names and loads of (up to 13 (!)) parameters…
…
Hardly anybody would even just try to use cuDNN for solving custom problems, because (at least for me) the effort of „learning“ this API seems to be prohibitively large.
[/QUOTE]

Although „Deep Learning“ has become some sort of hype recently, the details are quite hard to grok, and particularly the details of the cuDNN API.

What does this refer to? Websearches involving JBlas and cuDNN do not bring any obvious results…

There certainly are people who really use cuDNN -I guess, often in close collaboration with NVIDIA. But actual cuDNN code samples are rare. So right now, I could only throw in some websearch results, and do it like the authors: Here you are - and make sure to get the parameters right, y’know? :rolleyes: The most serious application of cuDNN that I’m aware of is in caffe, but the cuDNN backend is only a tiny part of that, and the source code does not look like something to quickly get started with.

For my side, I have to admit that I created the cuDNN wrapper upon request and ported the sample, but did not try to actively use cuDNN for custom projects until now. It would indeed be great if someone with a deeper understanding could provide some information (samples, tutorials) here…

typecheck · 28. April 2016 um 09:41

[QUOTE=Marco13]Hardly anybody knows that Microsoft and NVIDIA are doing with their drivers „under the hood“. I still wonder where these 10-15% come from … and how you are measuring it - did you have dedicated timings/benchmarks under Windows 7 that you can now compare with your Windows 10 results?

Regarding cuDNN: I’m not aware of any further documentation. As I said when the cuDNN wrapper was requested for the first time:

Although „Deep Learning“ has become some sort of hype recently, the details are quite hard to grok, and particularly the details of the cuDNN API.

What does this refer to? Websearches involving JBlas and cuDNN do not bring any obvious results…

There certainly are people who really use cuDNN -I guess, often in close collaboration with NVIDIA. But actual cuDNN code samples are rare. So right now, I could only throw in some websearch results, and do it like the authors: Here you are - and make sure to get the parameters right, y’know? :rolleyes: The most serious application of cuDNN that I’m aware of is in caffe, but the cuDNN backend is only a tiny part of that, and the source code does not look like something to quickly get started with.

For my side, I have to admit that I created the cuDNN wrapper upon request and ported the sample, but did not try to actively use cuDNN for custom projects until now. It would indeed be great if someone with a deeper understanding could provide some information (samples, tutorials) here…[/QUOTE]

I meant to say JCuda did the porting – i.e. you. I got twisted with my words since I got both JBlas and JCuda on my mind currently.

I appreciate the feedback. It is a frustrating experience with learning cuDNN. I am trying to implement a domain specific language for deep learning and so far I got good results for CPU version but I want to run on GPU since it is 50 times faster than my CPU version. If nobody has documentation/experience writeup on cuDNN, I will see what I can gather after learning it.

As to Caffe, its code is hard to decipher, which is why I want to do a DSL on deep learning in the first place.

About the performance differential between Win7 and Win10, the 10~15% is based on my memory since I no longer have a Win7 to compare with but I have detailed timers for each and every operation I do with LeNet. Right now, I only call GPU for matrix product and I recall I am getting like 30 ~ 40 seconds for the portion on GPU but now it is between 40 and 50 seconds.

Marco13 · 28. April 2016 um 12:41

Yes, JCuda==Marco13, in this case

[ot]
There occasionally is this misconception that I should know CUDA or related libraries very well. For CUDA, I know the API, some basic application patterns, some of the surrounding infrastructure, like PTX, compilation, and „the theory of some best practices“. But I haven’t been using CUDA for real, larger, practical applications yet, and of course, I can’t be familiar with all CUDA-based libraries. I still don’t know what my „favorite“ CUDA-NPP function nppiYCbCr420ToYCbCr411_8u_P3P2R actually does (some color conversion thing…). When there are particular issues or requests, I try to read about it and provide help as far as I can, but … this is not always possible in all depth … which leads to… cuDNN:
[/ot]

I would have expected that once someone is somewhat familiar with the concepts (by using a CPU-based implementation), it might be easier to get started with cuDNN as well. At least, I assumed that the (cryptic) API functions are basically parallel implementations of common steps that appear during the learning pipeline, but maybe that’s wrong. However, cuDNN can still be considered as being a „relatively new“ technology, and the documentation and sample coverage will hopefully increase soon.

(They published cuDNN v05 recently (the update is on my TODO list), maybe it contains some more examples)

LeNet was mentioned in the thread that I linked to above, alongside with things like Welcome — Theano 0.8.0 documentation , Torch and Caffe. I think this will be one of the main tasks in the near future: Hiding the nitty-gritty details of the cuDNN API behind a convenient abstraction layer. Still, someone has to know this stuff - and particularly well when he intends to write a wrapper for it.
Sorry that I can’t provide more specific or focussed help here.