About JCuda and JavaCPP


#1

There has been a comment in one of the JCuda GitHub issues :

Actually more important is that I suggest to merge/make obsolete this project in favor of JavaCPP and propose to developers of this project and jopencl as well to focus on JavaCPP stuff as it has much wider focus… just proposal making sense from my perspective.

This proposal is not entirely new: Two years ago, @saudet started a thread about his CUDA bindings based on JavaCPP , and we had a nice and interesting discussion (some of which went into technical details that may not be immediately be relevant here, but are of course related to some of the advantages and disadvantages of both solutions)


In general, I agree that a “universal” solution for accessing CUDA from Java would be desirable. And it should be obvious that JavaCPP tackles the problem of accessing C++ libraries from Java in the most generic sense, which can have many obvious advantages: The solutions generated with JavaCPP can be …

  • more generic solutions and interfaces
  • more robust due to being applied to vastly different application patterns
  • Important: they can be interoperable, because they are built on the same technical basis
  • Also important: They are mostly auto-generated, and can be updated far more quickly than my manual bindings

In contrast to that, JCuda went through several evolutionary stages. Originally, I started with naively creating bindings for CUBLAS, which seemed to be a low-hanging fruit due to the simple API and the huge performance improvements that it promised. Later, I added bindings for CUFFT. But before creating more individual bindings for the other CUDA libraries, I decided to put this on a common basis, namely JCuda.

Of course, this history implies some legacy. I did not have much experience with JNI before that, and the nature of CUDA bears some really challenging points. If I could start again, from scratch, I would solve some things differently. This mainly refers to all aspects of the memory handling (basically everything surrounding the Pointer class), but also to other details. In constrast to the manual JCuda bindings, JavaCPP is so generic that certain problems simply do not appear in the first place. (And I probably would not have started JCublas or JCuda if JavaCPP had been available 8 years earlier)

However, I will not abandon JCuda so quickly. I know that there are “~several” users of JCuda out there - although I don’t have a precise overview here. I only know that from support requests, and by some (scientific) papers/theses that are citing JCuda. (Some of them are erroneously citing a paper about a library called JCUDA, which is unrelated to JCuda, but the code snippets in the papers indicate that they are actually using JCuda and not JCUDA). And although it may sound a bit balky, defiant or pathetic: I consider it as a responsibility towards these users to continue to work on JCuda.

Time will show whether or not the two approaches can converge or how they will coexist. In the “best” case, they could both become obsolete, namely when an interoperability layer for accessing native libraries from Java is built into the language/standard API/JVM itself. This might also happen on a different level of abstraction. Although the project Sumatra is basically abandoned, the project Panama is still active, and only a few days ago, there has been a post about Experimental support for CUDA. This looks quite promising.

(I could now also mention some of the “advantages” that JCuda has in contrast to JavaCPP. Some of them would be subjective, like the fact that the manual 1:1 mapping of the CUDA API may allow a bit more control over what the API will look like (at the cost of slower update cycles). Others might become obsolete if JavaCPP-CUDA gained more attention - for example, the fact that there are actual samples that show how the library may be used to accomplish certain tasks. But I agree that these “advantages” do not necessarily outweigh the lower-level technical advantages that JavaCPP and its automated code-generation have, compared to the hand-crafted JNI bindings of JCuda).


#2

Hi Marco,

Thanks for the detailed overview. I generally agree with your points, I would just add two things:

  1. Responsibility to the users. Thank you very much for caring. As for me, I would gladly invest time to update my libraries to a better library that you’d retire JCuda in favor for, IF that library really offers a better approach

  2. However, I am not sure if javacpp is actually better at one important point: performance. I am not sure that it is worse, either, but I did some benchmarks with nd4j, their library that uses javacpp openblas bindings, and the overhead of nd4j was huge -> in the order of dozen microseconds. Of course, that overhead may be the nd4j thing, not the javacpp-openblas thing, but it is something that should be confirmed before deciding whether javacpp is actually more desirable than JCuda. It is certainly more desirable for libraries that change quickly, but CUDA seem to be backward compatible and it does not add lots of changes with each release, so JCuda’s ergonomics (and perhaps performance) might still be desirable.


#3

Hi @dragandj ,

performance is definitely on top of decision makers. I will ask Samuel to discuss the nd4j benchmarking experience with you when he has time.

In general I think it will be good to setup a project where some core implementation of selected algorithms will be implemented and both solutions will be compared either from code semantics and performance. I will setup the project on github and publish here. Or maybe I will ask to create the project under JavaCPP for better.

Of course still this thread is here also for theoretical stuff, so thanks @Marco13 to open this discussion again, lets see the outcome.


#4

Admittedly, I’m a bit suprised to see performance as being a priority (but it is also confirmed by the proposal to create a benchmarks repo). I would have considered the API to be an important point, too.

Of course, the whole goal of using the GPU is to increase performance, but I always assumed that for most application patterns, the performance penalty of the Java binding layer should be negligibe, compared to the two big bottlenecks: Kernel launches/runs, and memory copies.

To put it that way: In order to allocate 100MB on the device and fill it with host data, there are two function calls involved. The function call overhead and possible marshalling/unmarshalling/type covnersion of function arguments will likely not play a role here.

But sure, this depends on the application, and in this regard, it will be difficult to create a “sensible” benchmark. There are several variables, namely the ratio between copy and compute and their possible overlap, the size of the allocations, the number of kernel launches and the workload that is imposed by each kernel invocation.

I could imagine the repo to contain some “Hello World” example, showing the basic usage (maybe with the famous “vector addition” - simple, straightforward, but covers the most important operations).

Beyond that, there could be the actual “benchmark” code. Depending on the time and effort that can be invested here, this might be something that involves the parameters mentioned above, and varies them in some sort of cartesian product. For example, if there are many allocations/copies and kernel launches, and each kernel only takes a few milliseconds, then every overhead may become critical. (But how realistic is this?).

In any case, I’m curious to see the outcome of the benchmarks.


#5

API as priority -> no question, but performance as well.

Thank you for your suggestions on the benchmark design. I am also curious on the benchmark outcome :slight_smile:

The Jcuda vs JavaCPP benchmark repository is located here:

If I have time I will also include other candidate for comparing which I use (not main focus for bench):
Parallel Java 2
provided by Prof. Alan Kaminsky
Rochester Institute of Technology—Department of Computer Science
https://www.cs.rit.edu/~ark/pj2.shtml


#6

There already are some benchmark results published that involve JCuda, for example, these PDFs:

But they have a different focus than what we’re currently aiming at: They are comparing CUDA/JCuda/OpenCL/JOCL/MPI/Aparapi etc, but as far as I know, none of them covers JavaCPP


#7

I see a couple of cases when that overhead can become more than negligible:

  1. Various “supporting” functions that deal with parameters, etc. If cuda driver takes, say, 100 or 300 ns to do these, an overhead of 1000 ns is much heavier than an overhead of 100 ns (all hypothetical figures). For one call, that may not be much, but if you need to call this repeatedly, it adds up quickly.

  2. There can be many cases when actual kernel runs in microseconds, or nanoseconds, especially in kernels that do not require global reductions. In such cases setting the parameters and enqueuing the kernel is where the bottleneck is. Let’s say that the cuda driver overhead for such operation is 10 microseconds and that actual computation on the GPU is 10 microseconds. An overhead of 1 microseconds is ok, 10 microseconds is so-so, while 100 microseconds makes the Java application 5 times slower than its C++ counterpart from the start (not OK).

  3. I at least try to structure my applications so the data gets transferred to the GPU only once, then sits on the GPU while many kernels get enqueued to work on it. These invocations are very fast, and any overhead in the order of many microseconds is a disaster!


#8

Just for the reference, a simple JNI call has overhead of a dozen (let’s say 15) nanoseconds. If you add some objects, normally a ByteBuffer, it may be around 20-30 ns. That covers most stuff that we need to create a thin CUDA wrapper.

Of course, when you involve array copying, classes, and more complex Java/C++ mappings, it tends to skyrocket… But most of those things that are needed for a general C++ mapping solution are not strictly necessary for large parts of a CUDA wrapper.


#9

The case of “many” kernel invovations was also one that I already mentioned above, but considered it as somewhat unrealistic or at least unusual. For me, the classical application pattern for CUDA/GPU is still the basic one

  • Pump a lot of data to the device
  • Do the kernel invocations, that are expensive (for the kernel itself, and not so much for the launch overhead)
  • Copy an often “smaller” result back to the host

I’m sure that there are cases where many kernel invocations appear between the copy operations, but admittedly, I’m not sooo familiar with the variety of applications to name one of those. I’d think that the setup and invocation of a kernel is comparatively expensive even in plain CUDA, and thus, that one would try to do as much work in one kernel launch as possible. Nevertheless, the kernel invocation overhead could (and should) in doubt be measured with an artificial benchmark.

As for other functions, the only points that could make a noticable difference are the marshalling/unmarshalling of the function arguments, which usually just involves some casts, or reading a long field (namely, an address that is stored in a pointer). This is also because, as you already noted, the CUDA API is basically a C API and not a C++ one, where the binding itself may be more challenging in general.

In any case, one could argue that there is no practical way to achieve a faster invocation (that is, a lower pure call overhead) than with JNI. But still, the memory management (that is, particularly the handling of the kernel parameters) may have some degrees of freedom here. I’m curious to see the first results.


#10

I do parallell MCMC (Markov Chain Monte Carlo) simulations and other probabilistic stuff, which runs many fairly complex kernels. However, the GPU hardware is so fast, that they still run quickly, so the kernel enqueueing is not negligible at all.