Wrapper class

hi there,

Are there any plans to perhaps make a/some wrapper classes? leaving all the existing classes the same.
A motivation is that to use opencl it seemed there are a lot of boilerplate codes.
I’m thinking that for the simplest case, we could have a wrapper class or maybe interface that perhaps has a call back method void compute(cl_context context); the user can then implement the interface or extend the class and override the method. and/or perhaps a class representing the device so that we can call device.getContext();

Of course with a wrapper utility class, we’d lose some flexibility but perhaps it may lead to tider codes.
And for those cases that is less than generic, one can always fall back to the standard interfaces.

There are not only plans. Such classes already exist. But they are far to rough around the edges to be published.

You mentioned „leaving the existing classes the same“, and interfacees. The „core“ of the current state of my local „JOCLUtils“ project indeed consists mainly of static utility methods. It already allows to write code like this:

cl_platform_id platform = Platforms.getPlatforms().get(0);
cl_device_id device = Devices.getDevices(platform).get(0);
cl_context context = Contexts.create(platform, device);
    
cl_mem srcA = Mems.create(context, srcArrayA);
cl_mem dst = Mems.create(context, 1000 * Sizeof.cl_float);
            
cl_kernel kernel = Kernels.createFromSource(context, programSource, "sampleKernel");

Which already can be much more concise than the „low-level API“.

And these static methods avoid one problem that was the main reason of why I hesitated to create „real“, „object oriented“ wrappers. It may seem tempting to work towards something like this pseudocode :

Queue queue = Wrapper.createQueue(Device.withHighestFlops());
float result[] = queue.read(array).compute(kernel).write(result);

But there’s one thing that makes a „high-level, object oriented wrapper“ really, really difficult:

State!

Assuming that such a Wrapper-API should be 100% thread-safe, to be used and mixed arbitrarily with Java Threads, ThreadPoolExecutors and so on, it can be really challenging to avoid race conditions.

Beyond that, there are many more details. An obvious one is the Garbage Collector. One might think that ~„some reference-counting could work here“, the answer is: No, it does not work and will never work. Any attempt to do automatic resource management on the GPU with Java is doomed to fail.

However, I agree that people could benefit greatly from some sort of convenience wrapper, even if it is not really „Object Oriented“, but only a set of utility functions.

As mentioned in the other thread: Maybe I’ll have a bit of time for that soon, so I might be able to publish an early state of the core of the „JOCLUtils“ project, as some sort of „preview version“, to gather feedback.

hi Macrco13,
if need be use synchronized methods or sections, or perhaps offer 2 sets of the affected classes. I know that this sounds like a bummer, but that after all if this bothers some people they can always go back to the low level codes.
For now as i’m pretty much a novice in OpenCL, i’d think the real benefits of OpenCL is in use cases where most of that acceleration is done in the OpenCL kernel codes. this would make the issue of all that slowdown perhaps caused by synchronized methods less severe. the other trouble with synchronized methods of course is deadlocks. One way if it is ‚hard to deal with‘ is perhaps to document these as ‚not thread safe‘ disclaimers and that one should test them well when using them to avert complications.

a utility function for now is what i’m thinking off, some of the ‚boilerplate‘ codes can be placed in utility functions such as the platform and device initialization. as it seemed once the context is determined it can be passed to the compute routine. this may help the classes ‚look cleaner‘ by keeping codes that does the actual processing say in a class.

taking a cue from the python world, i think we can first address the ‚single threaded‘ world or use case.
take for instance Keras and Tensorflow have python front ends, but that the way it is used or at least it seem so superficially is that is is ‚single threaded‘ (i.e. no multiple threads). ‚deep down‘ in those codes i speculate they basically assemble the buffers and kernels and call out to the gpus but the tasks is also mostly ‚single threaded‘ this possibly makes sense as typical jupyter-notebook scripts looks very much sequential batch processing to me e.g. they first loads the data and process them sequentially possibly in batch. i’d guess this works as very much as those CNN stuff are very much iterative sequential solution search. the only part that is vectorised and massively parallel are ‚deep down‘ where the matrix multiplication etc are worked out and i’d guess pretty much the input is processed sequentially as a batch as well. one of those ways which i think that parallelism is hidden is that those input matrices can be huge e.g. x * y * z where z are the individual samples. if that whole matrix can be loaded into a buffer the gpu can literally process all that in a vectorized fashion. this gives an illusion of highly parallel processes for sequential input data

either way thanks much for all that effort, and i’m not expecting anything really. i just casually stumbled into this trying to pickup OpenCL and i’d say this is really good effort :smiley:

Comparing this to the Python world may bring a message across, but if you ever tried to use Keras in a multithreaded environment, you may have noticed odd behavior, „crashes“, and you may have found the quirky workarounds that have to be applied. Many people have worked hard to make „machine learning with Python“ look easy. It is not.

However, it’s certainly true that using multiple host threads already is somewhat „advanced“. The vast majority of use cases is comparatively simple, and the pattern is nearly boringly common:

  • upload some data to the GPU (usually some float[] arrays)
  • call some kernels on the data
  • download some result from the GPU (usually, into another float[] array)

For things like a matrix multiplication, this single chain of calls can already be dramatically faster than doing it on the CPU (even when taking the time for the memory copies into account!).

You may know that I’m also developing https://github.com/jcuda , and (similarly to JOCL) it tries to expose the whole underlying API, with all its complexity (but also, all its flexibility). There are libraries like „CUDA4J“, https://www.ibm.com/support/knowledgecenter/en/SSYKE2_8.0.0/com.ibm.java.api.80.doc/com.ibm.cuda/index.html that only expose a tiny, tiny fraction of the CUDA API, but it’s exactly the part that allows you to cover the simple use cases sketched above.

However, in any case, a first version of the JOCLUtils could consist of the „stateless“ (and thus, thread safe) collection of static utility methods. How far one can or should go beyond that (in terms of an Object-Oriented wrapper) is a different question.

thanks i’d check out jcuda in due course
i’d guess this in part as the gpu and driver itself is very much designed that way, i.e. the earlier gpus aren’t exactly designed for multi threaded access. it would seem then that for those hardware we’d need to serialize the gpu access perhaps in a queue etc. my guess is that one design pattern is to use a singleton e.g. Driver.getInstance() etc and we add the requests in a queue. this of course would be a big bottleneck in multithreaded environments. the only relieve in this case is that normally kernels execute fast. i.e. it probably is a bad idea to queue a million kernels to add 2 numbers than to place them in 2 arrays and letting opencl take care of it. unfortunately, there are quite a lot of use cases in which the calls are after all for small pieces of work, and in particular every kernel in the queue is a different kernel for a small piece of work. these probably won’t benefit from opencl

it seemed the more recent gpus and opencl stacks may allow multithreaded access but i’d guess those are with more recent gpus and their driver stacks

but literally as in my other thread, it isn’t that running things on cpu is slow, it is just that the jvm possibly along with all that synchronized low level (system) dependencies and the java execution and memory management processes itself, those alone add significant overheads to conventional cpu intensive compute. the same matrix multiplication in java 8 threads only delivers less than a 1 Gflops, while opencl easily exceeded that. i’ve used a bit of things like jblas etc where the blas routines are interfaced to openblas executing with cpu features like SSE, AVX2 etc. the speedups are similar, just that those are not as generic as opencl is these days. i’ve run c codes on an early generation AMD64 Athlon cpu prior, even for those if it is done in C codes, they easily deliver more than a couple gflops running on the (single core) cpu .

i think and it’d seem opencl is here to stay and it would become more pervasive. even supercomputers these days as i read the articles, some are basically based on opencl. they would otherwise require proprietary compute sdks which could take a lot of effort to integrate conventional apps say scale tested on small systems. opencl may thus open the world to ‚supercomputing as a service‘ and one may be able to run say a kernel remotely on a petaflops supercomputer and have the results back (possibly even over the internet) :wink:

Sure, the original idea of GPGPU was abusing the GPU with its shaders and texture storage for general computations. This opened the path to modern GPU computing, but with CUDA and OpenCL the idea has been taken much further. Particularly, for OpenCL, the idea was generalized to „heterogeneous computing“ and „abstract computing devices“, and it should be possible to exploit all of these.

In fact, the API of OpenCL itself already is rather Object-Oriented. Well, as object-oriented as it can be in C: You have memory objects, and kernel objects, and throw combinations of them into a „queue-object“… It seems to lend itself for a more convenient, object-oriented abstraction layer with an OO language like Java, but the devil is in the detail. (It’s somewhat easier for C++, e.g. see https://github.khronos.org/OpenCL-CLHPP/ , but not so much…).

„Tuning“ a system in the way that it achieves the highest possible performance is still something where quite some research has to be done. One can dive deeply into this, starting here:

Regarding the performance of BLAS with Java or C++: I’m pretty sure that there are still some myths around. Comparing a time-tested, raw, low-level C library that (may have been ported from Fortran and) has been optimized for decades in order to squeeze out the last 0.00x% of FLOPS in order to be ranked highest in some artificial LINPACK benchmark with a „Java Matrix Multiplication that uses executor services … for multithreading and such“ may not be fair. (No offense, you certainly did not intend to make an unfair comparison, but considering that this is a point where cache lines, prefetching, branch prediction, SSE and AVX and other things can come into play, Java is simply one abstraction level higher…)

The gist that you linked to also takes it on another level, as it mixes OpenCL and MPI. There are different levels on which the parallelization can take place, that’s for sure. (Several years ago, I created https://github.com/javagl/HazelcastMatMul , and once considered implementing the „lowest level“ with CUBLAS or so, but did not try this, due to some mix of competition and lack of demand and lack of time…)