cuDNN wrapper?

RoelVanderPaal · 29. Mai 2015 um 14:36

Are there any plans to create a cuDNN wrapper?

I tried myself using SWIG, but my knowledge of C++ is not sufficient.

Marco13 · 29. Mai 2015 um 18:49

Hello,

I wasn’t even aware that cuDNN existed. There are several libraries for which one could consider providing Java support - but I’ll first have to check whether this really makes sense. (For example, I also started JNpp, but the complexity of the API poses challanges for maintanance as well as for potential users). I’ll have a look at cuDNN and see whether mapping it to Java may be appropriate.

Thanks for this hint!
Marco

RoelVanderPaal · 1. Juni 2015 um 12:03

Thank you! If I can help with the Java part, please let me know

Marco13 · 16. Juni 2015 um 10:37

To avoid leaving this unanswered: I had a short look at cuDNN, particularly at the sample. I find that the actual topic is quite interesting. In fact, I’ve been working on a project related to „some form of neural networks“ for quite a while now, and the MNIST data set was one „benchmark“ test case. But for the case of cuDNN, I 'm not convinced about the usefulness: There is a single sample, and I doubt that there will be more in the near future. The „core“ functions of the given sample are basically a chain of (uncommented) calls to methods with cryptic names and loads of (up to 13 (!)) parameters…

Like this…
[spoiler]

cudnnSetTensor4dDescriptor(srcTensorDesc,
                            tensorFormat,
                            dataType,
                            n, c,
                            h, w);
cudnnSetFilter4dDescriptor(filterDesc,
                          dataType,
                          conv.outputs,
                          conv.inputs, 
                          conv.kernel_dim,
                          conv.kernel_dim);
cudnnSetConvolution2dDescriptor(convDesc,
                               // srcTensorDesc,
                                //filterDesc,
                                0,0, // padding
                                1,1, // stride
                                1,1, // upscale
                                CUDNN_CROSS_CORRELATION);
cudnnGetConvolution2dForwardOutputDim(convDesc,
                            srcTensorDesc,
                            filterDesc,
                            &n, &c, &h, &w);
cudnnSetTensor4dDescriptor(dstTensorDesc,
                            tensorFormat,
                            dataType,
                            n, c,
                            h,
                            w);
cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
                            srcTensorDesc,
                            filterDesc,
                            convDesc,
                            dstTensorDesc,
                            CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
                            0,
                            &algo
                            );
cudnnGetConvolutionForwardWorkspaceSize(cudnnHandle,
                            srcTensorDesc,
                            filterDesc,
                            convDesc,
                            dstTensorDesc,
                            algo,
                            &sizeInBytes);
cudnnConvolutionForward(cudnnHandle,
                          &alpha,
                          srcTensorDesc,
                          srcData,
                          filterDesc,
                          conv.data_d,
                          convDesc,
                          algo,
                          workSpace,
                          sizeInBytes,
                          &beta,
                          dstTensorDesc,
                          *dstData);

[/spoiler]

Hardly anybody would even just try to use cuDNN for solving custom problems, because (at least for me) the effort of „learning“ this API seems to be prohibitively large. And even if there were maybe 50 people in the world who seriously consider using this library: I doubt that there are many Java Programmers among them.

However, since you wrote the question, I at least have to consider that you might be one of them So did you specifically plan to use this for a certain application, or did you „just want to try it out“?

It would probably not be sooo much effort to feed the header into code generators and build something like JCuDNN, but at the moment, I think that there are (many) tasks in my queue with a (much) better effort-to-usefulness ratio…

RoelVanderPaal · 16. Juni 2015 um 12:45

Here is some interesting (and not too mathematical ) background on DNN (or convolutional neural networks): Convolutional Neural Networks (LeNet) — DeepLearning 0.1 documentation
To me, having read this page, the names of the methods of cuDNN make sense.

My impression is that DNN is quite hot at the moment, because of the recent breakthrough in image recognition, but I might be biased since I am subscribed on quite some newsfeeds about this topic

I asked for a cuDNN wrapper since I am porting the Theano framework (Welcome — Theano 0.7 documentation), which is written in Python, to Scala. Having cuDNN included would be nice, but the framework is already useful enough without it. I can always add it afterwards, when it becomes available.

FYI, I use the JCublas and JCudaVec libraries a lot, for which many thanks!

Marco13 · 16. Juni 2015 um 15:29

More resources to read, thanks for that, I will have a look (maybe not everything that has been researched about this topic was in the book about Neural Networks that I read 20 years ago and recently, I was rather focussed on an implementation of a SOM, so no „hot topic“ there either). Is your port already available on some public SCM?

RoelVanderPaal · 17. Juni 2015 um 12:33

You can find the first attempt I did 2 years ago here: https://bitbucket.org/RoelVanderPaal/algebra_old

I will be working the next weeks on integrating jcuda in the improved version. I will ping you here when this version is also publicly available.

Marco13 · 17. Juni 2015 um 15:57

I also started integrating GPU support into my SOM stuff (but with JOCL, as there are only few, simple custom kernels necessary, and no BLAS). It’s certainly a challenge to span the range between idiomatic, interface-heavy Java (or the even more abstract, functional Scala), and this close-to-metal procedural GPU stuff. I would consider it as ideal if it was possible to transparently switch between CPU and GPU computations, but … this requires things that “you would simply not do on the CPU” - primarily, things like “releasing memory/resources”. I’m curious to see which approaches you are going to take.

saudet · 18. Juni 2015 um 06:08

Hello,

I spend a couple of hours to wrap up cuDNN using JavaCPP. Since it references cuBLAS and the CUDA API itself, I also had to wrap those up together, but at runtime it shouldn’t matter if we use JCuda. I ported the mnistCUDNN.cpp sample to test it out, and it works just fine. It’s all available here:
https://github.com/bytedeco/javacpp-presets/tree/master/cuda

(I am targeting CUDA 6.5 because that’s what the current version of cuDNN uses, but I’ve made sure that the presets as is can generate wrappers for CUDA 7.0 as well.) To try them out, first get the latest source code for JavaCPP at https://github.com/bytedeco/javacpp and run mvn install to install that, before running mvn install on the presets. It should work on Linux, Mac OS X, and Windows the same, but I’ve only tested it on Linux x86-64 for now (and assuming everything can be found in /usr/local/cuda/).

I plan to wrap up other parts of CUDA while I’m at it, but the API is quite low level and a bit rough around the edges, so I believe there would be room for collaboration. Let me know what you think! Thanks

Samuel

(EDIT: I copied this post for some off-topic discussion to http://forum.byte-welt.net/byte-welt-projekte-projects/swogl-jcuda-jocl/jcuda/16661-java-wrappers-cuda-javacpp.html )

RoelVanderPaal · 21. Juni 2015 um 04:19

@saudet
Interesting project, with the additional advantage of being able to add it as a Maven dependency

I could not build it on OSX however, I created an issue for this: https://github.com/bytedeco/javacpp-presets/issues/57
I gladly help with testing this on OSX, so feel free to ping me if needed.

Marco13 · 21. Juni 2015 um 08:38

[ot]
These missing symbols appear occasionally. I noticed them in NPP, for example ( https://devtalk.nvidia.com/default/topic/774389/cuda-programming-and-performance/functions-missing-in-npp-lib-npp-on-win32-cuda-6-5-14/ ), but also in other libraries
[/ot]

alexd457 · 4. Juli 2015 um 11:47

I am also in need of CuDNN. The library may seem esoteric, and in fact it is not meant to be used by most people who are looking to use a deep neural network library. However, NVIDIA put in significant effort to provide optimized implementations of many of the popular neural network layers (in particular convolution, which is very processor-intensive and difficult to implement well… much like a good GEMM, but even more difficult), and the major NN libraries have adopted it. I have written my own NN library in Java (hooray Java! Most of the other libraries are for craptastic scripting languages like Lua.) which uses JCuda and JCublas, and would like to adopt CuDNN as well. I plan to open-source it soon. If it is not difficult for you, I would very much appreciate a quick (even rough) port, else I’ll have the double work to understand CuDNN and to learn how your build scripts/framework works. Or, we can collaborate.

I’ll also chime in to beg for Maven integration. I have a work-around that uses an in-project Maven repository, but it is so easy to support officially that I would implore you to do it. I’ve also written a simple, little Maven plugin that finds all the .cu files in my sources and compiles them if they’re out-of-date. Even more awesome would be a plugin that generates Java proxies for the kernels, but I haven’t tried writing that yet. (The proxy would be rather primitive and direct, mapping each device function to a method with a signature like myKernel( blockSize, gridSize, stream, sharedMem, Pointer arg1, Pointer arg2, Pointer arg3))

Marco13 · 4. Juli 2015 um 17:39

Yes, I also received an (indirect) request for “JCudnn” via mail. It seems that there are more people interested in that than I originally thought. I’ll try to put it a bit higher on my “todo” list.

Regarding the Maven integration: Isn’t https://github.com/MysterionRise/mavenized-jcuda what you’re looking for primarily? (As discussed in http://forum.byte-welt.net/byte-welt-projekte-projects/swogl-jcuda-jocl/jcuda/11567-parent-project-jcuda-usage-3.html#post119479 ).

Using a Proxy for calling the kernels was something that I also considered a while ago. I think that the KernelLauncher class from the jcuda.org - Utilities is already nearly that (except for not implementing a certain interface).

alexd457 · 4. Juli 2015 um 18:38

I haven’t tried Mavenized JCuda since it didn’t support 7.0 for a long time. Anyway, it doesn’t actually count as being on Maven since it’s not in Maven Central.

The main benefit of a compile-time generated proxy is that it lists all the functions in your cu file with the correct signatures (or, at least, the correct number of arguments). You get syntax errors if the names or signatures change, and you don’t need to create then save all of the KernelLaunchers to fields. It’s easier to invoke, as well, with new (since the proxy would know where to find the fatbin). Besides that, I’m not sure the KernelLauncher API is ideal. A “KernelLauncher” is both a function pointer, and a way to access the multiple functions in a module? I would refactor it into a class “Module” (the return value of one of many static methods that load CUDA code), and a class “Kernel”. Dynamically-loaded kernels have their uses, so proxies don’t need to displace KernelLauncher.

Maven is great. I’m surprised you’re not using it.

Marco13 · 5. Juli 2015 um 05:24

I’m using maven, for some of my other libraries. For example, for JOCL as well. But from my experiences there, Maven does not play very well with libraries that involve native binaries. They are simply not built during the “normal” build phase, and they HAVE to be built on different machines. (I know, there are plugins for native stuff - but it’s a hassle in any case).

(EDIT: A follow-up therad about Proxies for Kernel launches has been split from this one: http://forum.byte-welt.net/byte-welt-projekte-projects/swogl-jcuda-jocl/jcuda/16789-proxies-kernel-launches.html )

saudet · 12. Juli 2015 um 05:29

Hi guys,

BTW, it’s one of JavaCPP’s goal to figure out something that works well with Maven. It’s not perfect yet, obviously, but I welcome any ideas, so if you notice anything that could be done better, please do let me know! Thank you

Samuel

Felix3 · 19. September 2015 um 11:39

[QUOTE=Marco13]To avoid leaving this unanswered: I had a short look at cuDNN, particularly at the sample. I find that the actual topic is quite interesting. In fact, I’ve been working on a project related to “some form of neural networks” for quite a while now, and the MNIST data set was one “benchmark” test case. But for the case of cuDNN, I 'm not convinced about the usefulness: There is a single sample, and I doubt that there will be more in the near future. The “core” functions of the given sample are basically a chain of (uncommented) calls to methods with cryptic names and loads of (up to 13 (!)) parameters…

Like this…
[spoiler]

cudnnSetTensor4dDescriptor(srcTensorDesc,
                            tensorFormat,
                            dataType,
                            n, c,
                            h, w);
cudnnSetFilter4dDescriptor(filterDesc,
                          dataType,
                          conv.outputs,
                          conv.inputs, 
                          conv.kernel_dim,
                          conv.kernel_dim);
cudnnSetConvolution2dDescriptor(convDesc,
                               // srcTensorDesc,
                                //filterDesc,
                                0,0, // padding
                                1,1, // stride
                                1,1, // upscale
                                CUDNN_CROSS_CORRELATION);
cudnnGetConvolution2dForwardOutputDim(convDesc,
                            srcTensorDesc,
                            filterDesc,
                            &n, &c, &h, &w);
cudnnSetTensor4dDescriptor(dstTensorDesc,
                            tensorFormat,
                            dataType,
                            n, c,
                            h,
                            w);
cudnnGetConvolutionForwardAlgorithm(cudnnHandle,
                            srcTensorDesc,
                            filterDesc,
                            convDesc,
                            dstTensorDesc,
                            CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
                            0,
                            &algo
                            );
cudnnGetConvolutionForwardWorkspaceSize(cudnnHandle,
                            srcTensorDesc,
                            filterDesc,
                            convDesc,
                            dstTensorDesc,
                            algo,
                            &sizeInBytes);
cudnnConvolutionForward(cudnnHandle,
                          &alpha,
                          srcTensorDesc,
                          srcData,
                          filterDesc,
                          conv.data_d,
                          convDesc,
                          algo,
                          workSpace,
                          sizeInBytes,
                          &beta,
                          dstTensorDesc,
                          *dstData);

[/spoiler]

Hardly anybody would even just try to use cuDNN for solving custom problems, because (at least for me) the effort of “learning” this API seems to be prohibitively large. And even if there were maybe 50 people in the world who seriously consider using this library: I doubt that there are many Java Programmers among them.[/QUOTE]
Sorry but I don’t quite agree, on the contrary I would say that cuDNN is very useful for Deep Learning (which is becoming increasingly popular and important) research & development. cuDNN is now also supported by important DL frameworks/libs like Theano, Caffe, Torch etc. Most DL and DNN developers are using CUDA and increasingly (either directly or indirectly) cuDNN.

cuDNN is actually the reason why I stumbled into jcuda, and I was rather astonished that there seems to be no wrapper yet…
Maybe you could give a rough estimate if and when cuDNN support can be expected?

Thanks,
Felix

Marco13 · 19. September 2015 um 13:11

@Felix3 Yes, in the meantime I also noticed that „Deep Learning“ seems to be some sort of new buzzword in certain communities, and … „you can’t live without it any more“ Some tasks (e.g. also the update for CUDA 7.5) had higher priority until now, and I already scheduled some other task for the coming week, but as the API of cuDNN does not seem to be very complex structurally (but only conceptually), creating something like „JCudnn“ should not be sooo much effort. I’ll have another look and drop a note here and on the website when it is available.

Marco13 · 23. September 2015 um 13:23

@RoelVanderPaal @alexd457 @Felix3

A first version of JCudnn has been uploaded.

The source code is at https://github.com/jcuda/jcudnn (but I still have to fiddle a bit with the makefiles, to make building it a bit easier)
The binaries (currently, for Windows 64bit) are available in the downloads section of the website, at jcuda.org - Downloads
The only sample that is available until now, namely the MNIST digit recognition, has been ported and is available in the samples section at jcuda.org - Samples

(Disclaimer: I have no idea what this library is actually doing. Reading more about this, starting with the links that RoelVanderPaal provided, is still on my “todo” list. But maybe the current state is enough to gather some feedback).

Felix3 · 10. Oktober 2015 um 05:17

Hi Marco,

thanks for the very fast response and the work you have already invested! Sorry for my rather late reply (I was busy with other projects…).
I tried to run the sample today from my Eclipse project and got an exception in JCudnn.setExceptionsEnabled( true):

“No resource found with name ‘/lib/JCudnn-windows-x86_64.dll’”

The jcuddn-0.7.5.jar had been added as external jar file to the Eclipse project and the Native Library Location had been set properly too, just like with all other jcuda jars.
The other jars jcuda, jcublas, jcurand… can be initialized w/o problems and they apparently find their DLLs, but jcudnn does not.
Setting the JVM parameter -Djava.library.path=… did not help either.

I had a quick look at the source in GitHub but could not find an obvious reason, checked the spelling of the DLL - that seems to be OK too…

Any ideas?

Thanks,
Jürgen

*** Edit ***

Did a quick run in the debugger and it seems that the JCudnn-windows-x86_64 DLL is corrupted or invalid…