ClojureCUDA

Hi @Marco13 ,

I took a little time to create an initial version of Clojure CUDA integration. After one afternoon of development it only covers memory copying, mostly due to the fact that I had to read a bit of literature to transfer from the CL’s way of doing things that I’m used to. The good news is that JCuda works flawlessly so far :slight_smile:

The main thing that annosy me is related to CUDA’s insistence to do the context thread management. That makes it easier for beginners, but harder when I need more fine tuned stuff. Also, lots of cuMemcpy-XtoY are a bit less pleasing than OpenCL’s copying methods with flags, but I can live with that, and make it polymorphic in Clojure (memcpy-host! vs memcpy!). Also, when I called a procedure with a wrong type of pointer and got cuda error 999, cuda drivers stopped working properly, and even weren’t being able to initialize. Only the system restart helped. Again, due to CUDA, not JCuda. In similar cases with OpenCL, recovering AMD drivers was much easier.

You can see the code at GitHub - uncomplicate/clojurecuda: Clojure library for CUDA development
The tests with the examples are in this file: clojurecuda/core_test.clj at master · uncomplicate/clojurecuda · GitHub

If you have time to take a glance at the code and give me some feedback, that would be great :slight_smile:

Nice! I’ll probably not be able to give profound feedback, because I’ve never really used Clojure (I think I tried out your CL-related work once, but am a complete noob apart from that). But will take this as a chance, to see whether I get it up and running :slight_smile:

I’m surprised that you consider this as being easier. In fact, I think that the thread-dependencies and the context management are among the most tricky things of CUDA in direct comparison to CL. There is no necessity to even care about threads in OpenCL, whereas in CUDA, you have to be aware of the contexts and cuCtxPushCurrent/cuCtxPopCurrent (fortunately, this became a bit simpler, I think in CUDA 5.0 or so).

The different flavors of the memory copy calls can indeed be overwhelming. However, only a handful are really „important“, as the *Async- and 2D/3D- and Texture-Variants are only for very specific application cases of which I doubt that the would be routed through an (even thin) abstraction layer. (You will probably mainly use the Driver API anyhow. For the JNI part of JCuda, the most annoying thing is that exactly the same set of functions is available in the Runtime API - with slightly different names and signatures, but the same functionality. But that basically applies to the whole API, which is the same for Runtime and Driver, except for the few kernel/module/context managment functions that only exist in the Driver API).

On the one hand, this is a side effect of JCuda being a very thin layer around CUDA (i.e. NO abstraction). Whatever can go wrong in CUDA will directly affect JCuda and the JVM. (A typed API could explicitly differentiate between Host- and Device pointers, but this was not the intention behind the current JCuda implementation). This behavior, of course, is … unfortunate: Java programmers are used to see helpful stack traces, and often are even able to continue using the application (or at least, shutting it down in a gracious way). But the slightest error in (J)Cuda will often crash the JVM, painfully. There is basically no way to avoid this, because

  1. It’s nearly impossible (or at least, not practical) to „detect“ whether a pointer points to host or device memory
  2. In the kernel, there is no control whatsoever - they are basically C, and you can read and write to arbitrary (invalid) memory locations there…

I’ll have a look at the repo ASAP, and … try to give it a try :wink:

BTW: The yellow message at the top of the forum main page says:
TODAY, Saturday 4th of March, the Forum will be updated, and you’ll have to request a new password via mail. (If there is any trouble, you have some of my mail adresses - just in case)

Hey, thanks for those clarifications.

Regarding CUDA’s thread management: I agree, it’s overcomplicated, but I meant that beginners can just crank any example code without thinking too much, and it works. The problem is when multiple threads are needed (basically all real life code), and then we have to juggle those implicit things. In OpenCL, as you said, you have to take care of contexts/queues from the beginning, so you always know what’s there - I like that approach much more, at least so far.

The duplicate API: isn’t the Driver API created for separate host/kernel code, and runtime API for the CUDA’s “everything mixed up in one file” approach? In that sense, duplicate API makes sense if unified API couldn’t work for them, I guess. The thing that I don’t get is why you included the runtime API in JCuda at all, and when should I use it instead of the driver api?

If you want to run ClojureCUDA, please look at project.clj file - every dependency that is SNAPSHOT (my small commons lib), you’ll have to clone from github and build it yourself with lein midje (for tests) and then lein install.

EDIT: … and you’ll have to install leiningen, clojure’s maven alternative https://leiningen.org/

Another EDIT: My JVM never crashed, and even all JVM JCuda methods worked. It’s just that init would always fail. I later discovered that this CUDA drivers hangover happens every time the computer wakes from the suspend mode, and the only remedy is reboot, at least according to the internet and my (short) experience. Not a problem for servers, but a huge annoyance for developers who reboot once a month (me).

First: I’ve never encountered these „hangs“ that you describe, but quick websearches look like this is specific for Linux and certain driver versions - hopefully, NVIDIA will find a fix for this.

The duplicate API: isn’t the Driver API created for separate host/kernel code, and runtime API for the CUDA’s „everything mixed up in one file“ approach? In that sense, duplicate API makes sense if unified API couldn’t work for them, I guess. The thing that I don’t get is why you included the runtime API in JCuda at all, and when should I use it instead of the driver api?

Up to CUDA 3.0, the Runtime- and Driver API have been completely separated. So it was NOT possible to do

cuMemAlloc(pointer, ...);     // Alloc in Driver
cudaMemcpy(...pointer...);  // Copy in Runtime

I’m not sure what their original intentions have been to separate them. It was explained handwavingly, with something like ~„Runtime API is for normal users, Driver API offers more low level control“.

But since CUDA 3.0, they are interoperable. Of course, that’s a good thing, but … the APIs are now ridiculously redundant. The only difference is that

  • In the runtime API, you can use the <<>> launch syntax
  • In the driver API, you can handle contexts/modules/kernel functions manually
    For C applications, using the Runtime API makes sense when you want to use the <<>> syntax, or when you only want to use the runtime libraries (CUBLAS, CUFFT etc.).
    In JCuda, there is no <<>> syntax, so there is hardly a reason to use the Runtime API at all, because you can do everything in the Driver API (and still use the runtime libraries).

If you only want to use the Runtime libraries in JCuda, then you could use the Runtime API as well, but the difference to the Driver API is negligible. (I even think that cuMemCopyDtoH is more clear and less error prone than using the cudaMemcopyKind as in cudaMemcpy(.... somePointer, someOtherPointer, cudaMemcpyDeviceToHost), but that may be subjective)

EDIT: I had already installed Leiningen (and did the general setup) for my Neanderthal tests, but have to allocate some time to dive deeper into all this.

Some good news for windows users of Neanderthal is that from the next version (0.9.0) it will replace ATLAS with MKL, which is faster and easier to install on windows.