Hello world

i just tried out JOCL today, tried the sample, it works !
thanks to the developers !
i’ve been wanting to use my underutilised gpu and tolerating slow java though i use executor service with threads. But finally i’m able to use something much better than that

it turns out passing single arguments is a little tricky, kernel:

__kernel void mat_mul(__const int N,
		__global float *A, __global float *B, __global float *C) {
	int i, j, k;
	i = get_global_id(0);
	j = get_global_id(1);
	float tmp = 0.0f;
	// C(i, j) = sum(over k) A(i,k) * B(k,j)
	for (k = 0; k < N; k++) {
		tmp += A[i * N + k] * B[k * N + j];
	}
	//printf("i: %d,\tj: %d\n", i, j);
	C[i * N + j] = tmp;
}

to pass that N, i need to create a Pointer object and pass N as an array with a single item, host code:

int N = 10;
Pointer pN = Pointer.to(new int[]{N});
Pointer srcA = Pointer.to(srcArrayA);
Pointer srcB = Pointer.to(srcArrayB);
Pointer dst = Pointer.to(dstArray);

the rest are very much similar to the examples
http://www.jocl.org/samples/samples.html

Good to hear that you find JOCL useful!

There are certainly cases where 8 CPU cores are heating up under the load of several threads from an ExecutorService, while the GPU could offer 10 TFLOPS, which remain unused. The cases where GPUs play out their full potential are a bit narrow (and I wrote a bit more about that in https://stackoverflow.com/a/22868938/3182664 , even though the question only asked about CUDA), but there are many applications that can benefit from the GPU.


It’s true that passing single number arguments to a kernel is a bit inconvenient. But there is a reason for that:

The OpenCL setKernelArg function also expects a pointer with the actual argument value. In C/C++ this is simple: You can just take the address of a variable, using &:

int someInt = 123;
clSetKernelArg(kernel, 0, sizeof(int), &someInt);

Now, I did consider to offer convenience functions in the Pointer class. One could have a method like this:

public static Pointer to(int value) { ... }

so that one could call

int someInt = 123;
clSetKernelArg(kernel, 0, Sizeof.cl_int, Pointer.to(someInt));

The reason why I explicitly decided not to offer this method is simple: It would not have the behavior that people would likely expect. For a sequence of calls like this…

int someInt = 123;
Pointer pointer = Pointer.to(someInt);
someInt = 999;
clSetKernelArg(kernel, 0, Sizeof.cl_int, pointer);

people might expect the pointer to point to 999 afterwards - because that’s what similar C/C++ code would do. But in Java, there is no such thing as a „Pointer to an int value“. Using the int array[] based method makes this clearer. A sequence like

int someInt[] = { 123 };
Pointer pointer = Pointer.to(someInt);
someInt[0] = 999;
clSetKernelArg(kernel, 0, Sizeof.cl_int, pointer);

does have the desired effect.


More generally speaking: If I had to re-design JOCL from scratch, I’d implement the whole Pointer class differently. Other JNI-oriented libraries have classes like IntReference or IntPointer that more closely resemble the C/C++ world. But I think that for OpenCL/JOCL, the current Pointer class is sufficient, despite the slightly clumsy way of passing single-value arguments to kernels.

thanks for clarifying ! I’d think the current solution is fine just that it probably help to document somewhere so that those looking up the javadocs etc would find them.
i actually found that reviewing the samples for http://www.jocl.org/samples/JOCLReduction.java and http://www.jocl.org/samples/reduction.cl. those really helps

i did a little naive benchmarking comparing this kernel vs java:

Matrix size: 10  
OpenCL:
Flops:1,900
Dur:518,227 nsecs
GFlops :0.004
Java:
Flops:1,900
Dur:5,217,628 nsecs
GFlops :0
Speedup : 10.068 times
-----------------------
Matrix size: 100
OpenCL:
Flops:1,990,000
Dur:340,564 nsecs
GFlops :5.843
Java:
Flops:1,990,000
Dur:59,498,475 nsecs
GFlops :0.033
Speedup : 174.706 times
------------
Matrix size: 1000
OpenCL:
Flops:1,999,000,000
Dur:72,972,074 nsecs
GFlops :27.394
Java:
Flops:1,999,000,000
Dur:2,129,508,273 nsecs
GFlops :0.939
Speedup : 29.183 times
---------

this is basically square matrix multiplication. my opencl codes is probably not very optimised
the equivalent java code does everything in java just that it use 8 concurrent threads and executor service
the speedup with opencl is very substantial, the gpu is a Nvidia GTX 1070 cpu is a Intel Haswell i7 4790 (4 cores x 2 hyperthreads ~ 8 hyperthreads)

oops new users cannot attach files
linked here:
java opencl matrix multiplication tests

(Attached via edit by Marco13: )

matmul.zip (3,6 KB)

the results looks a little odd between size 10 and size 100 as i used executor service and 8 threads in java.
the results is just like opencl, when the problem size is very small e.g. 10x10 there is overheads setting up the executor service and threads, so the Gflops is very low.
at 100x100 both java and opencl benefits from parallelization, but vectorization in opencl takes that much much further than what is possible in java 174 times speedup !
i’d guess at 1000x1000 perhaps due to the rather unoptimised opencl codes and driving setup, it seem memory latency is limiting that somewhat, but still 29 times speedup for nearly 2 billion flops the choice is clear, opencl is the outright winner here

oops, there is a goof, in my java codes, i made my java codes determine the number of threads based on:

int wg = 8;
if(N < 8) 
    wg = N;
else {
    while(N % wg != 0) {
        wg /= 2;
        if(wg == 1) break;
    }
}

where wg is the number of concurrent threads. hence, this benchmark is biased at best,
i.e. for 10x10 it runs on 2 threads as that is the only match for N % 2 == 0
for 100x100 it runs on 4 threads N % 4 = 0
for 1000x1000 it runs on 8 threads N % 8 = 0
my guess is the differences would be lesser than it suggest, but nevertheless the 1000x1000 case is most representative of the large matrix case and 29 times speedup vs 8 java threads on cpu is still very significant. for the 10x10 and 100x100, the real speedups are likely much lesser, perhaps for 10x10 we can divide that by 4 ~ 3 times and 100x100 divide by 2 ~ 87 times instead of 174 times

For your local, personal estimation, such a benchmark may be … ok. But as you noticed, there are many caveats. I have not (yet) looked into the code in detail, but setting up such a benchmark is difficult. On the one hand, this refers to the infrastructure (time measurement, proper synchronization etc). On the other hand, it refers to one aspect that is crucial for estimating the performance improvement that can be achieved with the GPU, namely whether you take the time for memory copies into account or not. When comparing the time of the CPU and the GPU, one should probably always list

  • the time for the CPU
  • the time for the GPU, without memory copies
  • the time for the GPU, including memory copies

(all this determines whether your computation of the FLOPs makes sense - again, I haven’t yet looked at this in detail).

Beyond that, there are many possible optimizations and different ways of implementing a matrix multiplication on the GPU and the CPU. You may have seen GitHub - gpu/JOCLBlast: JOCLBlast - Java bindings for CLBlast , which is a binding to GitHub - CNugteren/CLBlast: Tuned OpenCL BLAS , and the author of the latter has published a tutorial on that: OpenCL matrix-multiplication SGEMM tutorial

And, to state the obvious: In order to compare the performance „objectively“, you’d also have to know the exact type of the CPU and GPU this is running on.

I’ll try to have a closer look at your code ASAP. Maybe I can do this alongside some other tasks: Bringing the JOCL Samples to GitHub has been on my TODO list for quite a while now. I might have a bit of time for that in the next 1-2 weeks, but cannot promise it yet. Beyond that, there’s also a „JOCLUtils“ library that is yet to be published … more on that in the other thread :wink:

thanks! i’m using java for various casual tasks. and basically various trials & experiments than any serious codes. this foray into opencl is simply because i’ve got a gpu installed. i used it with python and the various libraries that literally use opencl. but that i’ve meddled in java more often and is more familiar with it.
i started wondering if it is after all worth the effort to make a detour and do heterogeneous programming in java + opencl. i’m not too bothered by precise performance statistics, more than that i’m curious to find out if this is worth the trouble. this experiment is proving beyond most (perhaps all) doubts that there are very real benefits to integrate java + opencl. opencl these days runs on cpu as well, hence gpu is not an absolute necessity but really good to have. on the notion of ‚worth the effort‘, those performance statistics like that for multiplying 100x100 matrices, and OpenCL is 100 times the performances vs java (8 concurrent cpu threads) based on System.nanoTime() simply ‚smashes the glasses‘. i.e. for this particular metric gpu accelerated open cl on gtx 1070 is 100 times faster than Intel i7 4790 running 8 concurrent java threads one on each hyperthread (virtual core). (i did some code fixes offline so that it runs 8 java treads and verified it) this is astonishing performance!
the jvm is the real bottleneck, but that it is making Java + OpenCL heterogeneous programming well worth it. Java is known for being ‚hard to break‘ given that that there are no real pointers and has hidden away tons of developer memory leaks due to garbage collection. this is perhaps a very good combination, i.e. ‚hard to break‘ + binary level high performance with OpenCL

OK. But regarding the last part: You should be aware that nearly every JNI-library, and particularly one like JOCL that allows you to execute own pieces of code (kernels) drill a hole into the JVM. When you set up your kernel parameters in the wrong way, you may crash the JVM (and there’s hardly a way to avoid that…).

@ag123 The samples have been moved from the website to the newly created JOCLSamples repo:

These are currently really only the original standalone samples (with minor cleanups). But maybe I can add further samples (and things like the benchmark that you mentioned) later.