JOCLBLAS - Java bindings for clBLAS

EDIT: That was wrong

That was wrong…
[spoiler]
A short (non-) update regarding CLBLast: It’s far away from being compilable on Windows. I just updated to the latest VS version, but it still complains about many usages of constexpr an the initializer lists. Although I could probably (try to) create the bindings without compiling the actual library, I think that this does not make so much sense. A pity, because the actual API (at least from the header) looked nicely clean and simple…

(Maybe I’ll try it one day in a VirtualBox or so, but I think that cleaning up JOCL, polishing JOCLBLAS and creating JOCLSPARSE are clearer goals here).
[/spoiler]

It seems that VS2015 still used some of the VS2013 toolchains … just having another look at all this…

EDIT2: So after just a few hours of downloading, installing, restarting and messing around with the configuration, it finally compiled :slight_smile:

It compiled, but only a DLL file :frowning:

I wonder how this is supposed to be used on Windows: The library does not „dllexport“ any function at all.

I’m not an expert here, but this seems odd to me. I’ll ask the author about this…

… and he responded that he’ll look into this during the weekend.

Yes, Cedric is very responsive. I hope he’ll solve that problem quickly, because CLBlast seems quite clean compared to clblas, and, as a plus, it has decent performance on nvidia, too!

A quick report: clblast builds in 10-ish SECONDS on linux, and the resulting libclblast.so size is 1.5 MB.

Yes, we had some mail conversation. He added the required declspecs in the “development” branch, but they are not yet entirely right (he cannot regularly test on Windows, the same as for me with Mac+Linux…). I already did the changes locally and compiled the DLL and the required LIB (and indeed, it’s comparatively small). I’ll use this one to actually build and test the JOCLBlast library today, assuming that the declspecs will be added in one or the other form (maybe I’ll just fork and send a pull request - that’s the githubby way of doing this … :o )

@dragandj

OK, after a bit of a hassle during the builds, and some other issues that are still discussed via mail, the first, early version of JOCLBlast has been pushed to https://github.com/gpu/JOCLBlast

A first SGEMM sample, which was basically created by replacing the actual SGEMM call from the corresponding JOCLBLAS sample with that from JOCLBlast:

Basic sample
[spoiler]

package org.jocl.samples.blast;

import static org.jocl.CL.*;
import static org.jocl.blast.CLBlast.CLBlastSgemm;
import static org.jocl.blast.Layout.kRowMajor;
import static org.jocl.blast.Transpose.kNo;

import java.nio.FloatBuffer;
import java.util.Locale;

import org.jocl.*;
import org.jocl.blast.CLBlast;

public class JOCLBlastSample
{
    private static cl_context context;
    private static cl_command_queue commandQueue;

    /**
     * The entry point of this sample
     * 
     * @param args Not used
     */
    public static void main(String args[])
    {
        CL.setExceptionsEnabled(true);
        CLBlast.setExceptionsEnabled(true);

        defaultInitialization();
        
        // Create the host input data:
        // Matrix A with size MxK
        // Matrix B with size   KxN
        // Matrix C with size M x N
        int M = 4;
        int N = 3;
        int K = 5;
        float A[] =  
        {
            11, 12, 13, 14, 15,
            21, 22, 23, 24, 25,
            31, 32, 33, 34, 35,
            41, 42, 43, 44, 45,
        };
        float B[] = 
        { 
            11, 12, 13,
            21, 22, 23,
            31, 32, 33,
            41, 42, 43,
            51, 52, 53,
        };
        float C[] = 
        {
            11, 12, 13,
            21, 22, 23,
            31, 32, 33,
            41, 42, 43, 
        };
        
        // Create the device input buffers
        cl_mem memA = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            M * K * Sizeof.cl_float, null, null);
        cl_mem memB = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            K * N * Sizeof.cl_float, null, null);
        cl_mem memC = clCreateBuffer(context, CL_MEM_READ_WRITE, 
            M * N * Sizeof.cl_float, null, null);

        // Copy the host data to the device
        clEnqueueWriteBuffer(commandQueue, memA, CL_TRUE, 0, 
            M * K * Sizeof.cl_float, Pointer.to(A), 0, null, null);
        clEnqueueWriteBuffer(commandQueue, memB, CL_TRUE, 0, 
            K * N * Sizeof.cl_float, Pointer.to(B), 0, null, null);
        clEnqueueWriteBuffer(commandQueue, memC, CL_TRUE, 0, 
            M * N * Sizeof.cl_float, Pointer.to(C), 0, null, null);

        // Execute GEMM:
        // C = alpha * A * B + beta * C
        float alpha = 10;
        float beta = 20;
        cl_event event = new cl_event();
        CLBlastSgemm(
            kRowMajor, kNo, kNo, M, N, K, alpha, 
            memA, 0, K, 
            memB, 0, N, beta, 
            memC, 0, N, 
            commandQueue, event);
        
        // Wait for the computation to be finished
        // XXX CLBlast does not set the event properly.
        // This would cause a CL_INVALID_EVENT error 
        //clWaitForEvents( 1, new cl_event[] { event });

        // Copy the result data back to the host
        float result[] = new float[M*N];
        clEnqueueReadBuffer(commandQueue, memC, CL_TRUE, 0, 
            M * N * Sizeof.cl_float, Pointer.to(result), 0, null, null);

        // Print the inputs and the result
        System.out.println("A:");
        print2D(FloatBuffer.wrap(A), K);

        System.out.println("B:");
        print2D(FloatBuffer.wrap(B), N);

        System.out.println("C:");
        print2D(FloatBuffer.wrap(C), N);
        
        System.out.println(
            "Result of C = " + alpha + " * A * B + " + beta + " * C:");
        print2D(FloatBuffer.wrap(result), N);

        // Clean up
        clReleaseMemObject(memA);
        clReleaseMemObject(memB);
        clReleaseMemObject(memC);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);        
    }
    
    /**
     * Default OpenCL initialization of the context and command queue
     */
    private static void defaultInitialization()
    {
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        
        String deviceName = getString(device, CL_DEVICE_NAME);
        System.out.printf("CL_DEVICE_NAME: %s
", deviceName);
        
        // Create a command-queue
        commandQueue = clCreateCommandQueue(
            context, device, 0, null);

    }
    
    /**
     * Print the given buffer as a matrix with the given number of columns
     * 
     * @param data The buffer
     * @param columns The number of columns
     */
    private static void print2D(FloatBuffer data, int columns)
    {
        StringBuffer sb = new StringBuffer();
        for (int i=0; i<data.capacity(); i++)
        {
            sb.append(String.format(Locale.ENGLISH, "%5.1f ", data.get(i)));
            if (((i+1)%columns)==0)
            {
                sb.append("
");
            }
        }
        System.out.print(sb.toString());
    }
    
    private static String getString(cl_device_id device, int paramName)
    {
        // Obtain the length of the string that will be queried
        long size[] = new long[1];
        clGetDeviceInfo(device, paramName, 0, null, size);

        // Create a buffer of the appropriate size and fill it with the info
        byte buffer[] = new byte[(int)size[0]];
        clGetDeviceInfo(device, paramName, buffer.length, Pointer.to(buffer), null);

        // Create a string from the buffer (excluding the trailing \0 byte)
        return new String(buffer, 0, buffer.length-1);
    }

}

[/spoiler]

Now (in addition to the issues regarding CLBlast itself), there are some infrastructure points for JOCLBLAS and JOCLBlast, e.g. whether the builds (CMake files) work as desired on all platforms - and of course, some documentation (a proper readme.md, at least) and some real tests have to be added. I’ll try to schedule this ASAP. (Eventually, I also wanted to update the JOCL utilities and finally bring them to GitHub…)

Hey, this is great! Thank you. I’ll definitely integrate this into Neanderthal once it stabilizes (not because I think it is early, but because I am working on some other library at the moment).
Is there a way to handle tuning parameters from Java code?

BTW, nar maven plugin (nar-maven-plugin - NAR Plugin) is serving me well for easy multiplatform builds AND easy native library loading. When used with various other maven stuff (profiles, etc.), it really can achieve simple mvn clean, mvn install, mvn deploy workflow, with different settings for different OS. An example of how I am using it for integration with ATLAS BLAS is at GitHub - uncomplicate/neanderthal-atlas: JNI Bindings for ATLAS BLAS and LAPACK

Regarding NAR, I had a look at this quite a while ago, and from now skimming over your POM, it’s still not clear how the actual compilation takes place. Most native libraries have a CMake file, and it probably does not make sense to include “everything that the CMake Find scripts do” in some configuration file. In the end, for me, the conclusion basically was

  • Maven wants to build a JAR file
  • In order to build the final (one-in-all) JAR-File, the native libraries must be present
  • The native libraries have to be built on different OSes, and thus cannot be built in a single maven run

I also don’t see in how far the NAR should help with loading the library, but maybe I’ll have to take another look at the documentation…

Something like this:

  1. The external libraries (CLBlast in this case) are being built by their cmake, make or whatever, as you already do.
  2. nar pom is for building the JNI/Java library that also links the library (dll, so, dylib) built using 1.
  3. Nar uses 3-rd party project[1], set as a dependency, that automagically generates the class that is responsible for loading the appropriate library for the appropriate OS at runtime. You don’t even have to know about this, nar handles this completely automatically.
  4. nar packages it as any good old Java jar library + one native jar per OS (actually they produce .nar files, but I use a maven plugin to rename that to jar). You have to do that for each OS
  5. I use maven shade plugin to include the content for all platforms in one uberjar, and this is the only jar the user needs.
  6. Now, all that can be done by hand, but my process now is just one “mvn clean install” per os, copy the native-dependent jars to the main machine (Linux in my case) and one mvn install to get the final jar.

[1]

org.scijava
native-lib-loader
2.1.3

OK, so there is one JAR per OS in the end? How does it handle the dependency to the other DLL? For example, the JOCLBlast.dll requires the CLBlast.dll - will/can this be included in the JAR as well? (At least, one of the last changes in the “LibLoader” was the possibility to load dependent DLLs before loading the actual JNI DLL…)

And… the build process for JOCL is also not soo complicated: CMake for the native library, and then “mvn clean package”, which will pack the Java part (and all native libraries that are available then) together.

But there are different options that might even be better. For example, for JCuda, the native libraries are deployed as dedicated artifacts (but this is also just experimental - JCuda is not in Maven Central yet…).

Actually, there is one jar IN TOTAL at the end. You can see how the internal structure with native libs looks like here: https://clojars.org/repo/uncomplicate/neanderthal-native/0.5.0/

I guess that, in this case, the simplest solution would be to statically compile joclblast.dll, so there is no need for separate clblast.dll, but even if both libs are needed, clblast could be included as a resource. Additional bonus is that nar also versions the native library, so it would actually be joclblast-0.1.0.dll.

On the other hand, I only suggested NAR because it seemed to me that the current process is not completely straightforward for other potential contributors (I know that it is easy to you :)) - C tools are not that well known among Java developers and there are (were?) a few manual steps that require detailed hand holding. Potentially, it may not work so well on other OS without additional configuration. With detailed instructions that won’t be a problem.

[QUOTE=dragandj]Actually, there is one jar IN TOTAL at the end. You can see how the internal structure with native libs looks like here: Clojars Repository: uncomplicate/neanderthal-native/0.5.0/
[/quote]

OK, but still all natives have to be present before this can be built. For JCuda I considered one option that is used in a similar form for other libraries, namely that each native library is contained in its own JAR as its own Maven Artifact, and the main JAR only has a (platform-dependent) dependency to the JAR that contains the natives. This has the advantage that one can create(!) and use, for example, version „0.1.0“ with it dependency to the „windows-natives-0.1.0“, and later, the „linux-natives-0.1.0“ may be added, without affecting the existing JARs. But the fragmentation may be an issue, and usually, I’d also prefer a „one size fits all“ JAR in many cases.

This may work for CLBlast, but not in every case. It’s not always possible or desirable to do static linking (due to the file size, license issues etc). However, I considered to extend the „LibUtils“ that are now used in JOCL and JCuda to become a more general, standalone library. This raises some questions (the versioning („0.1.0“) only being one of them - this is currently not solved perfectly), but it seems that the „native-lib-loader“ already covers most of these aspects.

There are some points of which I’m not sure whether they are covered by the „native-lib-loader“. Particularly, wow and when it unpacks the libraries. In the end, they have to be loaded as files with System.load. And this has to be done in the „reverse dependency order“. So it FIRST has to load the CLBlast.dll, and THEN can load the JOCLBlast.dll. Where does it unpack these libraries? Into the default „temp“ directory? Will it unpack them each time that the JAR is loaded, or will it detect that the library already has been unpacked before, with the same version? (There had been some issues with Temp files on Windows: The (temporary) DLLs in the Temp directory could not be deleted at program exit, only with some odd workarounds). I’ll have to take a look at the source code for this (or, maybe, simply try it out…)

No.
(Sorry :wink: ). I hate it when people say: „I know how to do it, for me it’s easy“. It should be easy for everyone. I tried my best to bring this on a standard track of using CMake and Maven. And I think that this is a huge step forward, compared to the first versions of JOCL, which contained some odd make files that I had copied from some OpenCL samples for the native part, and NO build files for the Java part at all :wink:

And if there are possible improvements of the build process (maybe through something like the NAR plugin), I’m always open to hear about them :slight_smile: particularly from people who really use them.

I think I already told you that the NAR plugin was already considered as an option when I first tried to bring JOCL into Maven Central, but that back then, it did not seem as mature as now. There also had been some alternatives in discussion. Even going as far as trying to run the build directly from Maven using the CMake maven plugin. But in the end, the process of

  1. Building the natives
  2. Packing the natives into the JAR
    seemed to be the simplest, and this is the current approach, without the NAR plugin. However, I’ll re-consider the NAR plugin, also in view of the questions mentioned above, although I’m not sure which benefits the NAR plugin itself brings exactly when the natives are build beforehand. The loading mechanism may be interesting, you mentioned

that automagically generates the class that is responsible for loading the appropriate library for the appropriate OS

(although this is done by the „native-lib-loader“, and does not seem to be related to the NAR plugin directly) : What does this class look like? (I had a look at a decompiled version of „NarSystems“ in your Neanderthal JAR, but where does the source code of this class come from?)

[QUOTE=Marco13]

(although this is done by the “native-lib-loader”, and does not seem to be related to the NAR plugin directly) : What does this class look like? (I had a look at a decompiled version of “NarSystems” in your Neanderthal JAR, but where does the source code of this class come from?)[/QUOTE]

It is generated by nar inside the target directory of the project as a part of maven build process.

OK, I’ll try how well “native-lib-loader” works with dependent DLLs (they don’t seem to say anything about that in their readme, at least). If it works, and the NAR plugin simplifies its integration (that is, its invocation during the Maven build), then this might be the next step.

OK, if I can help with some of that please ask, although it seems to me that you’ll quickly be better informed about this than me. :slight_smile:

Not necessarily. The subset of my projects that are “visible” (via https://github.com/jcuda, https://github.com/gpu/ and https://github.com/javagl ) are only the tip of the iceberg, and I’m iterating through these (and the “invisble” ones ;-)) in a round-robin fashion. Sometimes, things are delayed by other, more pressing things. However, the issue of a proper build process / mavenization is relevant for JOCL and even more for JCuda, and if something like the NAR could offer a clean solution here, that would be great. Maybe I can have another look at NAR during the weekend.

A side note: The most recent “development” branch of CLBlast contains the fixes that are required for building the proper lib on Windows, and the fix regarding the events (they are now filled with the proper events that are created internally).

Hi Marco,

Now that JOCLBlast is (mostly) working like a charm, how near on your horizon are JOCLSparse and JOCLFFT?

Yes, they are still on my “todo” list. I’d start with JOCLSparse, and already had a look at the headers etc., but it may be a bit more fiddly than the plain BLAS bindings. But recently I’ve been busy with several other tasks - ranging from nearly “recreational” things, over technical ones (CUDA 8 RC was published, so JCuda will need an update, too), and … sooner or later I’ll have to find a new job. That’s annoying.

However: “(mostly)” - well… I contacted Cedric Nugteren whether he has an idea what might be wrong with loading the library on Mac. Particularly, what these “rpath” settings in the CLBlast makefile aim at. Right now, I’m not even sure whether the problem is caused by JOCL not having such settings, or by the fact that the settings in CLBlast would have to be different for our (admittedly somewhat exotic) application case. (Or, the worst case, whether it’s not possible to load a library like that at all on Mac - but that’s hard to imagine).

The fallback of static linking for 0.7.2 still exists, but I wouldn’t be surprised if it was just some minor magical CMake setting to get it running in the current form…