Compatibility of JCuda and JOCL memory pointers

dragandj · 1. März 2017 um 05:06

Hi Marco,

Thank you for releasing the latest JCUDA on Maven Central. I will soon start making a Clojure library that uses it. Since I already have a Clojure layer for JOCL (ClojureCL) I wonder whether buffers on an Nvidia GPU that I create with JOCL can be somehow referenced from JCUDA, and vice versa. That is, are JOCL and JCUDA interoperable, or I’d have to map/unmap/copy those buffers through the host memory, although they are both in the global memory of the same (Nvidia) GPU?

According to this thread, it is not something that depends only on JOCL/JCUDA, but is a deeper issue in how the drivers implement OpenCL: Trying to mix in OpenCL with CUDA in NVIDIA’s SDK template - Stack Overflow

It’s always better to ask the creator

Marco13 · 1. März 2017 um 09:16

That’s hard to say. In doubt, one has to assume that the answer is „no“.

(I think that this question already came up several years ago, but didn’t find it right now).

The link in the stackoverflow answer is dead, but one quickly finds the corresponding threads in the NVIDIA forum: Interoperability OpenCL/CUDA - CUDA Programming and Performance - NVIDIA Developer Forums There (and at other places), a suggestion is made, namely to share the data via OpenGL: Both APIs offer (more or less) direct access to GL buffers. They have a common format, and they already reside in the GPU memory - so this could be a way to avoid memory copies.

But still, this would be fiddly: The GL memory has to be mapped in both APIs. So the workflow would ROUGHLY (!) be…

cl_mem joclMemory = ...;
CUdeviceptr jcudaMemory = ...;

// Initialization:
int glBuffer = gl.createBuffer();
cuGraphicsGLRegisterBuffer(...glBuffer...);
joclMemory = clCreateFromGLBuffer(...glBuffer...);

// Using in JCuda
cuGraphicsMapResources(...);
cuGraphicsResourceGetMappedPointer(...);
doSomeCudaCall(jcudaMemory); // The actual usage
cuGraphicsUnmapResources(...);

// Using in JOCL
clEnqueueAcquireGLObjects(...);
doSomeJOCLCall(joclMemory); // The actual usage
clEnqueueReleaseGLObjects(...)

This has several implications:
[ul]
[li] It requires synchronization (some cuCtxSynchronize and clFinish calls are omitted here)[/li][li] Important: It requires a GL context![/li][li] Important: Some (or all?) operations must be made on the GL thread, while the context is current![/li][li] Last but not least: There is no guarantee that this will not actually copy the memory![/li][/ul]

NVIDIA will certainly not specify something like this. In fact, for OpenCL, the memory handling is supposed to be rather „transparent“. In many cases, you cannot say on which device a cl_mem actually resides (although you can apply some „common sense“, particularly when there is only one device in the first place). But even then, IIRC, the mapping operations between OpenGL/CUDA and OpenGL/OpenCL often do not explicitly say what is happening „under the hood“ (for good reasons, probably: They want to have the freedom to change the behavior, and want to prevent users from making brittle assumptions…)

All that being said:

I think that the path over OpenGL has some considerable caveats, and I would not „recommend“ it. Even if it worked in a small test, it would be hard to make statements about the reliability or the behavior in the future.

You mentioned

I’d have to map/unmap/copy those buffers through the host memory

The point of mapping the memory may be important here. For pinned host memory (as, for example, in jcuda-samples/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMemoryBandwidths.java at master · jcuda/jcuda-samples · GitHub ), the transfer rate for larger memory blocks is ~13GB per second, so I think that the potential overhead for going over the host may be small enough to justify going this (probably) easier path, in contrast to the (possibly (!) faster, but far more complicated) path over OpenGL.

I started creating a basic test in which I consider to evaluate the options here (i.e. a very basic example/test showing the GL-based path and the pinned-memory-based path, possibly also comparing the performance), but as usual, I cannot give a deadline for this.

dragandj · 1. März 2017 um 09:35

Thank you very much for that detailed explanation, Marco. It seems that the mapping would be the go-to approach when the interop is required. Frankly, I hoped that it won’t be necessary because, however fast data transfer is, it is still waaaay slower than simply enqueuing the kernel on the data that is already there. Of course, the best thing would be if we had the opencl equivalents for (most of) nvidia’s libraries, so the interop wouldn’t be necessary, but I guess it is still a long way to there.

On the other hand, there is some movement on Nvidia’s part to support OpenCL 2.0 The newest Windows drivers come with (partial) beta support for 2.0, although it still

I hope I’ll have some useful feedback when I start actually working with JCuda and the rest of cu libraries.

Marco13 · 1. März 2017 um 12:37

[QUOTE=dragandj]Thank you very much for that detailed explanation, Marco. It seems that the mapping would be the go-to approach when the interop is required. Frankly, I hoped that it won’t be necessary because, however fast data transfer is, it is still waaaay slower than simply enqueuing the kernel on the data that is already there. Of course, the best thing would be if we had the opencl equivalents for (most of) nvidia’s libraries, so the interop wouldn’t be necessary, but I guess it is still a long way to there.
[/quote]

That’s true. Unfortunately, the clSPARSE development basically seems to have stopped (there are a few newer commits in the „develop“ branch, though).

Thanks for this hint, I didn’t know that. In general, the lack of commitment of NVIDIA to OpenCL is somewhat disappointing. They basically already have all the code in CUFFT and CUSPARSE - porting this to CL should be comparatively easy. But I think that they are aware that these runtime libraries are one of the main reasons of why someone might choose CUDA instead of OpenCL, so it’s unlikely that they will make any public/open source contributions here.

dragandj · 1. März 2017 um 13:02

Regarding clSPARSE, maybe it is already well-rounded, so it does not need much improvement to be really useful. The trouble is that their build process is a bit odd, so we both failed in the first attempt at compiling it. However, since the binary is available, it is definitely (in my case) because of not trying hard enough to dig up what was the problem. I expect that once I create a sparse support for Neanderthal based on MKL for CPU and clSPARSE for nvidia, I will be able to approach it more seriously and get it to build, and (hopefully) be able to give you a hint about what was wrong the last time.

In general, I am optimistic regarding this, since, due to Clojure’s (and Java’s) dynamic nature compared to C/C++ we do not even need the perfect sparse/fft/dnn library with 100% features. Basically, getting the 50% of the right features properly would boost this ecosystem really well, since the existing libraries lack even that.

Marco13 · 1. März 2017 um 13:51

On Windows, I got the build running, but the result had only been
https://github.com/clMathLibraries/clSPARSE/issues/197 and
https://github.com/clMathLibraries/clSPARSE/issues/198
I think I intended to try it with the pre-built binaries, but haven’t done so - these first results hadn’t been so encouraging, and I’m hesitant to invest much time when the underlying library may become obsolete soon.
But you’re right: There are not many alternatives.
(This refers to GPU-based libraries. Of course, there are tons of Sparse Matrix / FFT libraries for plain Java…)

However, sparse matrices are certainly more challenging, not only referring to the low-level implementation, but also to the API that may be exposed in Java or Clojure: The different possible reresentations of matrices require dedicated “handles” for each of them (i.e. no longer plain float[] matrix arrays. However, at least CSR is so common that it could be a common denominator. (That’s also the only one that is currently contained in https://github.com/jcuda/jcuda-matrix-utils …)

dragandj · 1. März 2017 um 14:46

As for the requirement for dedicated „handles“, I already do that for dense matrices. That is the point of the library: user uses various high-level abstractions that handle all the ByteBuffer fiddling underneath. I suppose that the same principle would be usable for sparse matrices, with a bit (or a lot) more work to make it elegant. As for the API of the underlying Java library, the thinner and more low-level it is, the better

Marco13 · 1. März 2017 um 16:39

Well… it’s difficult. I’ve always been amused by the ridiculous difference between…

the Matrix class in JAMA: Matrix
and “the” “Matrix” class in UJMP: https://ujmp.org/latest-release/ujmp-core/apidocs/org/ujmp/core/doublematrix/impl/DefaultDenseDoubleMatrix2D.html
The latter is what most closely resembles the JAMA Matrix class, but extends 8 other classes and implements a whopping 77 (seventy-seven) interfaces :eek:

You can go really far with abstractions (and that’s not a bad thing), and still, although UJMP can basically wrap every other matrix library (about ~15 alreay being available), I think that using it for wrapping GPU-based matrices would still not be possible. Designing a Matrix library with the GPU in mind would probably enforce some usage patterns that would seem odd for plain Java. (Most obviously: There has to be some “dispose()” method somewhere …).
But maybe this is just another dimension in the design space that is often dominated by the choice between “performance” and “genericity”…

In any case: I could imagine that having a look at the clSPARSE API and the CUSPARSE API might help to get an idea what they have in common, and what an API (i.e. a set of interfaces) would have to look like if it was supposed to be implemented with either of them. But that’s just a gut feeling - I’m sure that you already have some ideas, also from your experience with the dense matrix case.

dragandj · 2. März 2017 um 02:08

I’m aware of many variations of matrix classes in Java. That’s why I took a bit different, if a bit opinionated approach:

Neanderthal does not offer a „universal“ matrix wrapper or API. Its implementation is specifically targeted at raw buffer storage and BLAS and related capabilities. It IS possible to use pure java implementations, and its interfaces support non-BLAS way of doing things, I just don’t provide the implementations or encourage that approach.
Similarly to how this is done in BLAS, it separates the storage structure of various matrices and the algorithms into pluggable objects. For example, it has RealGEMatrix, which deals with column or row storage in buffers (DirectByteBuffer by default). It can be dynamically paired with a CBLAS based engine (which requires col or row oriented raw memory buffer) that I provide, or any user-provided engine, perhaps implemented in pure Java, that knows how to compute such structures. Similarly, there is CLGEMatrix, which does similar (but not 100% same) thing in the OpenCL space with cl buffers. It is paired with CLBlast-based engine, or can be (and was) paired with my custom engine coded with ClojureCL (JOCL). RealGEMatrix and CLGEMatrix have some similarities and some differences, and that’s OK. There are transfer methods to move data around. They are not and should not be 100% transparent because they should be used at the same time in hybrid CPU/GPU code. I am really satisfied how it worked so far in a sufficiently complex code that I wrote. What’s more, I can see in detail how I would be able to do an integration of non-buffer based non-blas based Java code that I control. Of course, a large part of this is due to BLAS being highly standardized. It will probably be more difficult to do this smoothly for sparse matrices.
Which leads us to sparse. I’m aware that it is more complex and not so standardized, but as I understand, even there we have some standard data formats (CSR) and some standard operations that are used in most leading libraries. Let’s first cover those, and later see about the more exotic stuff Also, I expect that there are variations in implementations and even APIs of different native/GPU libraries that implement those, but that’s what the engines in my library are for - to smooth out those differences
What would be more difficult is what many other JVM libraries try to do: enabling the integration of arbitrary Java matrix libraries into the ecosystem. I am NOT making that easier. On purpose. Many of them are just poor quality. Many are just slight variations of the same „universal api“ approach. Moreover, most of them just lack functionality beyond blas-equivalent stuff. If they have solvers, the coverage is haphazard. They are orders of magnitude slower than native libraries. What I am trying to do is: enable the state of the art stuff in Clojure/JVM. No more and no less. That means the integration with native libraries and GPU libraries. That does not mean support for each and every legacy pure Java implementation that some enterprise need for some arbitrary business reason. That also means the functionality that has been proven to be useful. Doesn’t mean that I need to support any brilliant idea that might or might not turn out to be right. Also, I accept that the library would not satisfy some critical requirements for some projects - notably the „pure Java, no native stuff“, or „must run without installation on Raspberry Pi“. I designed the library to be able to adapt even to such stuff, but I do not optimize towards that.

All in all, instead of being megalomanic, I try to take a start small, think big approach. I want to provide the Ford T of numerical computing in Clojure (and JVM). Something that maybe covers 20% of the functionality, but the 20% that is the state of the art, and required by 90% of the applications anyway.

Marco13 · 2. März 2017 um 05:17

The latter is an important point, and in line with the goals that you stated in 1). There have been approaches for making things like data transfer „transparent“. I think that this is a nice goal. And at other places, I advocated for making this transparent. But this transparency has to happen at a different level - not at „our“ level (the Java/Clojure layer), but one level deeper. If NVIDIA or AMD continue with the Unified Memory developments, or the transparency that OpenCL aims at in the context of „Heterogeneous Computing“, then it may one day be possible to create a „cl_mem“ or „Pointer“, without having to care whether it is in host- or device memory. (Eventually, this may be the same memory!).
But right now, this is not the case. And the control that is necessary there must be passed through the API - even if this means that there is an un-Java-ish „dispose()“ method. (Some tried to leave this to the CG. It does not work.)

If the goal was to create „THE ultimate Sparse-Matrix-Abstraction“, then you would have to consider not only CSR, but also CCS, BCRS, CDS, JDS, SKS, COO and things like „HYB“ from NVIDIA, which is vendor-specific format, mainly intended to squeeze out the last 0.x% of some „theoretical peak FLOPS“ in benchmarks.
But clearly stating that this is not the goal makes it possible to define the actual goal more clearly: Basically every Sparse Matrix Lib supports CSR, which is a common denominator, probably covers >80% of all application cases out of the box (and close to 100% if one accepts that there may be a conversion step from „some exotic format“ to CSR).

dragandj;145330:

Moreover, most of them just lack functionality beyond blas-equivalent stuff. If they have solvers, the coverage is haphazard. They are orders of magnitude slower than native libraries. What I am trying to do is: enable the state of the art stuff in Clojure/JVM. No more and no less.
…
Also, I accept that the library would not satisfy some critical requirements for some projects - notably the „pure Java, no native stuff“, or „must run without installation on Raspberry Pi“. I designed the library to be able to adapt even to such stuff, but I do not optimize towards that.

All in all, instead of being megalomanic, I try to take a start small, think big approach. I want to provide the Ford T of numerical computing in Clojure (and JVM). Something that maybe covers 20% of the functionality, but the 20% that is the state of the art, and required by 90% of the applications anyway.

That’s also in line with the aforementioned points. There are cases where an „ultimate abstraction“ like in UJMP may come in handy. But I also think that in many (or most) cases, the application areas are far more narrow: Sparse (Double or Float) CSR Matrices, and certain BLAS/Solving operations, that can often be used in a „black box“ fashion: Collect the input (usually, a double-CSR). Run the solver. Obtain the output (usually, a double-CSR). If someone needs a solver for ND-Matrices with BigDecimal entries, then he can use UJMP, but (or because) this will not be implementable efficiently on the GPU anyhow.

A short (off topic) side note:

The forum software will be updated soon. Just as a hint, that

the forum might (!) be offline for a while in the next days
your passwort might be reset and you may receive a new one via mail
(hoping that your mail address is valid)

dragandj · 2. März 2017 um 07:10

On the related note: What do you think about binding to another library that offers some of the mentioned functionality, like ViennaCL or clMAGMA? They are certainly not abandonware, and might be relied upon to be there for a long time. I just haven’t used them, so I am not sure how feasible it is to create thin wrappers, or how well they are suited for JNI bindings, since I suppose they use lots of C++ stuff. Have you considered them?

Marco13 · 2. März 2017 um 09:29

As for ViennaCL, the difficulty here is indeed that they are very C+±heavy. Their goal was to be idiomatic for C++ and the STL (i.e. lots of templates). Something like this…


typedef float        ScalarType;

// Define a few GPU vectors using ViennaCL
viennacl::vector<ScalarType> vcl_vec1(10);
...
vcl_s1 = viennacl::linalg::inner_prod(vcl_vec1, vcl_vec2);

can hardly be mapped to Java. Of course, one could try to “mimic” the library, but I think that would not make much sense: The difference between the C+± and the Java version would be so large, that it would be justified (and “better”) to aim at an idiomatic Java API here. (One could implement this Java layer based on a JNI path to ViennaCL, but maybe it would then even make more sense to just pick the kernel code from ViennaCL and offer these kernels directly - but I’m not familiar enough with ViennaCL to say whether this would be reasonable)

I didn’t have clMAGMA on the radar, though. From a quick glance at ONE example, it looks like it might lend itself more to be ported with a “thin” layer (as for “thicker” layers, the question about the thickness (i.e. level of abstraction) becomes more important). It doesn’t seem to be sooo active (the last change was >1 year ago, and most of the last changes are related to makefiles). I’d consider it, but admittedly, will most likely not find the time to maintain something like “JclMAGMA” (I’m already falling behind with JOCL updates - OpenCL 2.2 is already on its way, and I haven’t even updated to OpenCL 2.1…)

dragandj · 2. März 2017 um 09:47

Just to be clear, I am not trying to talk you into implementing something because I think it might be useful to me. What you already do with JOCL/JOCLBLast/JCuda is of high value! As for OpenCL 2.2/2.1, are there any drivers that support those? The latest news that I know is that Intel only recently supported 2.0. AMD supports it in Catalyst since 2014, but their (newer!) AMDGPU-PRO drivers are still on 1.2 as far as I know.

So, from the theoretical point of view, JOCL is a bit behind. But, from the practical POW, it is leading the pack. Even if JOCL 2.2 was available today, I doubt anyone would find the difference from 2.0, or I am missing something?

What I think is VERY useful, is the stuff that you and Cedric provided with JOCLBlast/CLBlast! That is the library that is of high quality, and really fills in the gap. The next big gap is LAPACK/sparse support for OpenCL on the JVM. ViennaCL or clMAGMA could fill it to some degree, but they are clunky and hard to integrate, and you confirmed my suspicions. Once I need that functionality, I planned to try to port some of their kernels using ClojureCL, but luckily I have lots of other more effective stuff on the TODO list before I reach that point → namely more support for MKL-based functionalities on CPU, and ClojureCUDA (based on your just-in-time JCuda 0.8.0 maven release

Thank you again for your generous code contributions to the community!

Marco13 · 2. März 2017 um 11:21

You’re not the only one who asks about additional libraries ViennaCL being one of them, and Thrust being another (the latter is also too template-heavy to be portable to Java in a reasonable way). However, CUB, as requested in Building wrappers to CUB calls · Issue #11 · jcuda/jcuda-main · GitHub may be something that I could tackle soon, or at least see which of the approaches could make most sense. BTW: The update for OpenCL 2.1/2.2 is not „pressing“ in that sense, but I’d like to avoid having to take several steps at once, so I’ll try to update to 2.1 before there is a real, pressing demand for 2.2.

I’m looking forward to see what you’re going to do with JCuda, and any feedback that may result from that