JOCLBLAS - Java bindings for clBLAS

Marco13 · 7. Januar 2016 um 10:31

A first version of JOCLBLAS has just been pushed to https://github.com/gpu/JOCLBLAS

Note that this library is still under construction

It is intended to provide Java bindings (based on JOCL) for clBlas, from https://github.com/clMathLibraries/clBLAS

A first preview of the library (and the matching JOCL library) may be obtained from the following links:

http://jocl.org/jocl-0.2.0-RC01-SNAPSHOT.jar
http://jocl.org/jocl-blas-0.0.1-SNAPSHOT.jar

(Note that these are intended as a preview, and may be removed when the consolidated, non-SNAPSHOT versions are available in Maven Central)

The libraries contain the native libraries for Windows, 64bit, of JOCL, JOCLBLAS and clBLAS, and the native libraries should be loaded transparently at runtime - so it should be sufficient to just add the JARs to the classpath, as usual.

Here is a small sample (basically, the example from the clBLAS) performing a SGEMM operation:

/*
 * JOCLBLAS - Java bindings for clBLAS
 * 
 * Copyright 2016 Marco Hutter - http://www.jocl.org/
 */
package org.jocl.blas;

import static org.jocl.CL.*;
import static org.jocl.blas.CLBLAS.clblasSetup;
import static org.jocl.blas.CLBLAS.clblasSgemm;
import static org.jocl.blas.CLBLAS.clblasTeardown;
import static org.jocl.blas.clblasOrder.clblasRowMajor;
import static org.jocl.blas.clblasTranspose.clblasNoTrans;

import java.nio.*;
import java.util.*;

import org.jocl.*;

public class JOCLBLASSample
{
    private static cl_context context;
    private static cl_command_queue commandQueue;

    /**
     * The entry point of this sample
     * 
     * @param args Not used
     */
    public static void main(String args[])
    {
        defaultInitialization();
        
        CLBLAS.setExceptionsEnabled(true);
        clblasSetup( );
        
        // Create the host input data:
        // Matrix A with size MxK
        // Matrix B with size   KxN
        // Matrix C with size M x N
        int M = 4;
        int N = 3;
        int K = 5;
        float A[] =  
        {
            11, 12, 13, 14, 15,
            21, 22, 23, 24, 25,
            31, 32, 33, 34, 35,
            41, 42, 43, 44, 45,
        };
        float B[] = 
        { 
            11, 12, 13,
            21, 22, 23,
            31, 32, 33,
            41, 42, 43,
            51, 52, 53,
        };
        float C[] = 
        {
            11, 12, 13,
            21, 22, 23,
            31, 32, 33,
            41, 42, 43, 
        };

        // Create the device input buffers
        cl_mem memA = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            M * K * Sizeof.cl_float, null, null);
        cl_mem memB = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            K * N * Sizeof.cl_float, null, null);
        cl_mem memC = clCreateBuffer(context, CL_MEM_READ_WRITE, 
            M * N * Sizeof.cl_float, null, null);

        // Copy the host data to the device
        clEnqueueWriteBuffer(commandQueue, memA, CL_TRUE, 0, 
            M * K * Sizeof.cl_float, Pointer.to(A), 0, null, null);
        clEnqueueWriteBuffer(commandQueue, memB, CL_TRUE, 0, 
            K * N * Sizeof.cl_float, Pointer.to(B), 0, null, null);
        clEnqueueWriteBuffer(commandQueue, memC, CL_TRUE, 0, 
            M * N * Sizeof.cl_float, Pointer.to(C), 0, null, null);

        // Execute GEMM:
        // C = alpha * A * B + beta * C
        float alpha = 10;
        float beta = 20;
        cl_event event = new cl_event();
        cl_event[] events = { event };
        clblasSgemm(clblasRowMajor, clblasNoTrans, clblasNoTrans, M, N, K,
            alpha, memA, 0, K, memB, 0, N, beta, memC, 0, N, 1,
            new cl_command_queue[] { commandQueue }, 0, null, events);

        // Wait for the computation to be finished
        clWaitForEvents( 1, events );

        // Copy the result data back to the host
        float result[] = new float[M*N];
        clEnqueueReadBuffer(commandQueue, memC, CL_TRUE, 0, 
            M * N * Sizeof.cl_float, Pointer.to(result), 0, null, null);

        // Print the inputs and the result
        System.out.println("A:");
        print2D(FloatBuffer.wrap(A), K);

        System.out.println("B:");
        print2D(FloatBuffer.wrap(B), N);

        System.out.println("C:");
        print2D(FloatBuffer.wrap(C), N);
        
        System.out.println(
            "Result of C = " + alpha + " * A * B + " + beta + " * C:");
        print2D(FloatBuffer.wrap(result), N);

        // Clean up
        clReleaseMemObject(memA);
        clReleaseMemObject(memB);
        clReleaseMemObject(memC);
        clblasTeardown();
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);        
    }
    
    /**
     * Default OpenCL initialization of the context and command queue
     */
    private static void defaultInitialization()
    {
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        
        String deviceName = getString(devices[0], CL_DEVICE_NAME);
        System.out.printf("CL_DEVICE_NAME: %s
", deviceName);
        
        // Create a command-queue
        commandQueue = clCreateCommandQueue(
            context, devices[0], 0, null);

    }
    
    /**
     * Print the given buffer as a matrix with the given number of columns
     * 
     * @param data The buffer
     * @param columns The number of columns
     */
    private static void print2D(FloatBuffer data, int columns)
    {
        StringBuffer sb = new StringBuffer();
        for (int i=0; i<data.capacity(); i++)
        {
            sb.append(String.format(Locale.ENGLISH, "%5.1f ", data.get(i)));
            if (((i+1)%columns)==0)
            {
                sb.append("
");
            }
        }
        System.out.print(sb.toString());
    }
    
    private static String getString(cl_device_id device, int paramName)
    {
        // Obtain the length of the string that will be queried
        long size[] = new long[1];
        clGetDeviceInfo(device, paramName, 0, null, size);

        // Create a buffer of the appropriate size and fill it with the info
        byte buffer[] = new byte[(int)size[0]];
        clGetDeviceInfo(device, paramName, buffer.length, Pointer.to(buffer), null);

        // Create a string from the buffer (excluding the trailing \0 byte)
        return new String(buffer, 0, buffer.length-1);
    }
    
    
}

dragandj · 12. Januar 2016 um 14:07

Hi Marco,

This is fantastic stuff that is still missing in the Java ecosystem.

Do you have any benchmarks? How far are you from the full support?

I develop a Clojure matrix library, and currently maintain my own kernels for the GPU part. Performance is satisfactory, and I learned a lot building it, but I do not have time and resources to go for full clBLAS coverage and specially clSPARSE coverage. I would prefer to be able to reuse clBLAS, but at the time it seemed to me too undocumented that it was easier to go for my implementation than to beat clBLAS to work. So, jocl-clblas would be very useful. When (approximately) you plan to bring it to the state of at least alpha quality? I would be very interested in using it in my Neanderthal library.

Do you also plan to include clSPARSE support as part of this library, and if not, do you plan to support it as a separate library?

Thumbs up for great work!

Marco13 · 12. Januar 2016 um 16:12

Hello,

Yes, I know that there is some demand for such a library.

I made some first “benchmark”, for a another thread here, comparing a simple SGEMM with JCublas to the SGEMM of JOCLBLAS - I’ll repeat the results here:

ROUGH first benchmark results
[spoiler]


benchmark | cols | rows  | iterations | GFLOPS     | avg.ms  | MB/s HtoD  | MB/s DtoH  | 
    Sgemm |  200 | 40000 |         50 |  756.70776 | 4.22885 | 2845.26855 | 2798.70239 | 
SgemmJOCL |  200 | 40000 |         50 |  669.12274 | 4.78238 | 2745.60156 | 2726.60913 | 
    Sgemm |  250 | 25600 |         50 |  638.10065 | 5.01488 | 2837.22876 | 2794.69263 | 
SgemmJOCL |  250 | 25600 |         50 |  793.85791 | 4.03095 | 2712.33130 | 2690.46387 | 
    Sgemm |  300 | 17777 |         50 | 1323.16248 | 2.41834 | 2826.13184 | 2795.17651 | 
SgemmJOCL |  300 | 17777 |         50 |  781.54242 | 4.09429 | 2694.78687 | 2723.56519 | 
    Sgemm |  350 | 13061 |         50 |  998.84589 | 3.20364 | 2809.91382 | 2783.36011 | 
SgemmJOCL |  350 | 13061 |         50 |  749.62042 | 4.26875 | 2728.29834 | 2743.68384 | 
    Sgemm |  400 | 10000 |         50 | 2221.35034 | 1.44057 | 2791.08643 | 2771.02197 | 
SgemmJOCL |  400 | 10000 |         50 |  725.81610 | 4.40883 | 2675.41724 | 2681.30737 | 
    Sgemm |  450 |  7901 |         50 | 1227.35168 | 2.60716 | 2788.84180 | 2774.18750 | 
SgemmJOCL |  450 |  7901 |         50 |  746.02606 | 4.28927 | 2644.92798 | 2682.72095 | 
    Sgemm |  500 |  6400 |         50 | 1062.76782 | 3.01101 | 2779.85767 | 2730.93726 | 
SgemmJOCL |  500 |  6400 |         50 |  841.04102 | 3.80481 | 2611.94556 | 2625.33936 | 
    Sgemm |  550 |  5289 |         50 | 1750.34875 | 1.82812 | 2802.30005 | 2779.72803 | 
SgemmJOCL |  550 |  5289 |         50 |  792.39508 | 4.03819 | 2622.79761 | 2643.74268 | 
    Sgemm |  600 |  4444 |         50 | 1330.41821 | 2.40502 | 2767.75464 | 2725.53857 | 
SgemmJOCL |  600 |  4444 |         50 |  762.01740 | 4.19896 | 2598.58179 | 2596.15625 | 
    Sgemm |  650 |  3786 |         50 | 2412.81152 | 1.32591 | 2766.19580 | 2752.69019 | 
SgemmJOCL |  650 |  3786 |         50 |  821.62775 | 3.89370 | 2609.78198 | 2597.44043 | 
    Sgemm |  700 |  3265 |         50 | 1846.03345 | 1.73328 | 2893.15894 | 2813.14136 | 
SgemmJOCL |  700 |  3265 |         50 |  926.34888 | 3.45410 | 2606.59619 | 2603.51611 | 
    Sgemm |  750 |  2844 |         50 | 1580.02527 | 2.02497 | 2733.06934 | 2723.69434 | 
SgemmJOCL |  750 |  2844 |         50 |  613.42462 | 5.21580 | 2477.55054 | 2530.30054 | 
    Sgemm |  800 |  2500 |         50 | 2223.13721 | 1.43941 | 2762.05078 | 2747.63965 | 
SgemmJOCL |  800 |  2500 |         50 |  812.36914 | 3.93910 | 2503.89331 | 2546.23364 | 
    Sgemm |  850 |  2214 |         50 | 1768.19531 | 1.80932 | 2766.73242 | 2719.52100 | 
SgemmJOCL |  850 |  2214 |         50 |  862.27576 | 3.71022 | 2616.78467 | 2544.56372 | 
    Sgemm |  900 |  1975 |         50 | 2471.87134 | 1.29436 | 2816.65308 | 2696.80762 | 
SgemmJOCL |  900 |  1975 |         50 |  937.53320 | 3.41268 | 2583.44409 | 2469.92651 | 
    Sgemm |  950 |  1772 |         50 | 1810.81567 | 1.76631 | 2807.92944 | 2666.83936 | 
SgemmJOCL |  950 |  1772 |         50 |  953.76654 | 3.35350 | 2495.50513 | 2473.54785 | 
    Sgemm | 1000 |  1600 |         50 | 1708.23010 | 1.87328 | 2724.58472 | 2677.20776 | 
SgemmJOCL | 1000 |  1600 |         50 |  772.83923 | 4.14058 | 2556.13867 | 2443.28369 |

[/spoiler]

As also mentioned in the other thread, I was a bit disappointed, but of course, CUBLAS is heavily tweaked and optimized for NVIDIA GPUs, and far less versatile than OpenCL. This comes at a price, I guess. However, this is really the result of a first, quick run, so take it with a grain of salt. (I’m particularly curious to compare the clBLAS-based version running on a CPU against some standard Java implementation)

Regarding the coverage of the functions and the “version” of the library: Actually, it should already be complete. But I’m always hesitant with making things public. I have never ever released something with a version number “1.0.0”.

And of course I also intend to create similar bindings for clSPARSE, clFFT and clRNG, with priority on clSPARSE. One unknown here is my code generator. It has grown over the years, and it’s a huge mess with ridiculous generalizations in it, and it’s hard to maintain the “configurations/setups” for generating the different bindings. I already started cleaning this up, aiming at making it more “pragmatic” (it will never be used for something else than generating the JNI binding code, anyhow), but have not yet figured out a precise timeline.

BTW: Actually, I would have preferred to “implement” a JOCL-based BLAS only based on kernels - some sort of “BLAS kernel repository”, consisting of a bunch of pure .CL files that can be loaded and executed as desired. Maybe with a thin convenience wrapper, but without (yet another) native library. But I think that such a repo does hardly exist for BLAS, and even less for sparse BLAS kernels. I tried to figure out where the clBLAS kernels reside, but they are built “magically”, and can probably not be loaded as simple CL files.

bye
Marco

dragandj · 12. Januar 2016 um 18:13

[QUOTE=Marco13]Hello,

Yes, I know that there is some demand for such a library.

I made some first „benchmark“, for a another thread here, comparing a simple SGEMM with JCublas to the SGEMM of JOCLBLAS - I’ll repeat the results here:

ROUGH first benchmark results


benchmark | cols | rows  | iterations | GFLOPS     | avg.ms  | MB/s HtoD  | MB/s DtoH  | 
    Sgemm |  200 | 40000 |         50 |  756.70776 | 4.22885 | 2845.26855 | 2798.70239 | 
SgemmJOCL |  200 | 40000 |         50 |  669.12274 | 4.78238 | 2745.60156 | 2726.60913 | 
    Sgemm |  250 | 25600 |         50 |  638.10065 | 5.01488 | 2837.22876 | 2794.69263 | 
SgemmJOCL |  250 | 25600 |         50 |  793.85791 | 4.03095 | 2712.33130 | 2690.46387 | 
    Sgemm |  300 | 17777 |         50 | 1323.16248 | 2.41834 | 2826.13184 | 2795.17651 | 
SgemmJOCL |  300 | 17777 |         50 |  781.54242 | 4.09429 | 2694.78687 | 2723.56519 | 
    Sgemm |  350 | 13061 |         50 |  998.84589 | 3.20364 | 2809.91382 | 2783.36011 | 
SgemmJOCL |  350 | 13061 |         50 |  749.62042 | 4.26875 | 2728.29834 | 2743.68384 | 
    Sgemm |  400 | 10000 |         50 | 2221.35034 | 1.44057 | 2791.08643 | 2771.02197 | 
SgemmJOCL |  400 | 10000 |         50 |  725.81610 | 4.40883 | 2675.41724 | 2681.30737 | 
    Sgemm |  450 |  7901 |         50 | 1227.35168 | 2.60716 | 2788.84180 | 2774.18750 | 
SgemmJOCL |  450 |  7901 |         50 |  746.02606 | 4.28927 | 2644.92798 | 2682.72095 | 
    Sgemm |  500 |  6400 |         50 | 1062.76782 | 3.01101 | 2779.85767 | 2730.93726 | 
SgemmJOCL |  500 |  6400 |         50 |  841.04102 | 3.80481 | 2611.94556 | 2625.33936 | 
    Sgemm |  550 |  5289 |         50 | 1750.34875 | 1.82812 | 2802.30005 | 2779.72803 | 
SgemmJOCL |  550 |  5289 |         50 |  792.39508 | 4.03819 | 2622.79761 | 2643.74268 | 
    Sgemm |  600 |  4444 |         50 | 1330.41821 | 2.40502 | 2767.75464 | 2725.53857 | 
SgemmJOCL |  600 |  4444 |         50 |  762.01740 | 4.19896 | 2598.58179 | 2596.15625 | 
    Sgemm |  650 |  3786 |         50 | 2412.81152 | 1.32591 | 2766.19580 | 2752.69019 | 
SgemmJOCL |  650 |  3786 |         50 |  821.62775 | 3.89370 | 2609.78198 | 2597.44043 | 
    Sgemm |  700 |  3265 |         50 | 1846.03345 | 1.73328 | 2893.15894 | 2813.14136 | 
SgemmJOCL |  700 |  3265 |         50 |  926.34888 | 3.45410 | 2606.59619 | 2603.51611 | 
    Sgemm |  750 |  2844 |         50 | 1580.02527 | 2.02497 | 2733.06934 | 2723.69434 | 
SgemmJOCL |  750 |  2844 |         50 |  613.42462 | 5.21580 | 2477.55054 | 2530.30054 | 
    Sgemm |  800 |  2500 |         50 | 2223.13721 | 1.43941 | 2762.05078 | 2747.63965 | 
SgemmJOCL |  800 |  2500 |         50 |  812.36914 | 3.93910 | 2503.89331 | 2546.23364 | 
    Sgemm |  850 |  2214 |         50 | 1768.19531 | 1.80932 | 2766.73242 | 2719.52100 | 
SgemmJOCL |  850 |  2214 |         50 |  862.27576 | 3.71022 | 2616.78467 | 2544.56372 | 
    Sgemm |  900 |  1975 |         50 | 2471.87134 | 1.29436 | 2816.65308 | 2696.80762 | 
SgemmJOCL |  900 |  1975 |         50 |  937.53320 | 3.41268 | 2583.44409 | 2469.92651 | 
    Sgemm |  950 |  1772 |         50 | 1810.81567 | 1.76631 | 2807.92944 | 2666.83936 | 
SgemmJOCL |  950 |  1772 |         50 |  953.76654 | 3.35350 | 2495.50513 | 2473.54785 | 
    Sgemm | 1000 |  1600 |         50 | 1708.23010 | 1.87328 | 2724.58472 | 2677.20776 | 
SgemmJOCL | 1000 |  1600 |         50 |  772.83923 | 4.14058 | 2556.13867 | 2443.28369 |

As also mentioned in the other thread, I was a bit disappointed, but of course, CUBLAS is heavily tweaked and optimized for NVIDIA GPUs, and far less versatile than OpenCL. This comes at a price, I guess. However, this is really the result of a first, quick run, so take it with a grain of salt. (I’m particularly curious to compare the clBLAS-based version running on a CPU against some standard Java implementation)

[/quote]

clBLAS times seem about expected here. I hope you optimized it for your architecture, and I think the benchmarks should also include 2^n dimensions that are usually more performant than round x10 dimensions.

What bugs me is why jocl-clblas versions are so slow. My library does not have any significant overhead despite also being called from Java, and using JOCL for api calls.

Regarding the coverage of the functions and the „version“ of the library: Actually, it should already be complete. But I’m always hesitant with making things public. I have never ever released something with a version number „1.0.0“.

And of course I also intend to create similar bindings for clSPARSE, clFFT and clRNG, with priority on clSPARSE. One unknown here is my code generator. It has grown over the years, and it’s a huge mess with ridiculous generalizations in it, and it’s hard to maintain the „configurations/setups“ for generating the different bindings. I already started cleaning this up, aiming at making it more „pragmatic“ (it will never be used for something else than generating the JNI binding code, anyhow), but have not yet figured out a precise timeline.

Now, THOSE would be a godsend, even with the overhead and the slowdown! But, it’s better if the source of the overhead is tracked down by then

BTW: Actually, I would have preferred to „implement“ a JOCL-based BLAS only based on kernels - some sort of „BLAS kernel repository“, consisting of a bunch of pure .CL files that can be loaded and executed as desired. Maybe with a thin convenience wrapper, but without (yet another) native library. But I think that such a repo does hardly exist for BLAS, and even less for sparse BLAS kernels. I tried to figure out where the clBLAS kernels reside, but they are built „magically“, and can probably not be loaded as simple CL files.

bye
Marco

Wouldn’t we all?

The problem is that the kernels have to be extremely low-level and architecture-dependent. clBLAS does it by heavy relying on C macros and a bunch of C code generation - that’s why you couldn’t find the CL files. There are no pure CL files with kernels there, just a bunch of horrible hodge-podge of strings inside C++ files that contain C code full of macros.

If you go the pure-kernel route, there are a couple of things to have in mind:

Much of the code logic is not in kernels, but on the host, so you’ll have much more to do than to generate API to kernel calls.
Each architecture needs separate host and kernels implementations, or you are going the macro route like clBLAS and end up with a mess.
Those kernels need a lot, lot, lot debugging and fine tuning.

You can see my implementation and benchmarks at Neanderthal - Fast Native Matrix and Linear Algebra in Clojure
For now, it is not a good idea for me to try to migrate to jocl-clblas, because of the performance overhead, but when that is solved, jocl-clblas would be THE Java library for such computations.

Marco13 · 13. Januar 2016 um 04:33

[QUOTE=dragandj]clBLAS times seem about expected here. I hope you optimized it for your architecture, and I think the benchmarks should also include 2^n dimensions that are usually more performant than round x10 dimensions.

What bugs me is why jocl-clblas versions are so slow. My library does not have any significant overhead despite also being called from Java, and using JOCL for api calls.
[/quote]

There might be a misunderstanding (although I’m not sure). In these benchmarks…


benchmark | cols | rows  | iterations | GFLOPS     | avg.ms  | MB/s HtoD  | MB/s DtoH  | 
...
    Sgemm | 1000 |  1600 |         50 | 1708.23010 | 1.87328 | 2724.58472 | 2677.20776 | 
SgemmJOCL | 1000 |  1600 |         50 |  772.83923 | 4.14058 | 2556.13867 | 2443.28369 |

Sgemm is the performance of the JCublas-based SGEMM and
SgemmJOCL is the performance of the JOCLBLAS-based SGEMM.

I did not do a comparison between clBLAS and JOCLBLAS. But I expect the performance to be basically the same, as the overhead for the JNI calls should be negligible for such „large“ matrices.

You mentioned
I hope you optimized it for your architecture
I’m not sure what you mean, but I just compiled clBLAS as-it-is, from the given makefiles, without diving deeper into possible optimization flags. And I’m not sure whether this would be appropriate at all: My intention was to include the clBLAS.dll (and respective Linux/Mac libraries) in the JAR, so that people can simply add the JAR to their classpath, and don’t have the to manually compile a (native) dependency on their own and put it into the (LD_LIBRARY_)PATH.

BTW: Eventually, this should also be available at Maven, therefore I extended the native library loader so that it can load native dependencies (like the clBLAS.DLL) that are required for loading the actual binding library (JOCLBLAS.DLL), as in

String dependentLibraryNames[] = { "clBLAS" };
LibUtils.loadLibrary("JOCLBLAS_0_0_1", dependentLibraryNames);

But this also has to be cleaned up a bit (fortunately, this is not „public“, so refactorings are not so critical in this case).

I guess this also refers to the (assumed) overhead in the benchmark, but again: This was JCublas+JOCLBLAS, and not clBLAS+JOCLBLAS. If this is particularly interesting, I could do a dedicated benchmark comparing clBLAS and JOCLBLAS, but am not sure what the point should be here: The actual kernel execution time (when measured with the CL Event Profiling) should be exactly the same. The only relevant part would be the overhead that is introduced by the JNI calls. And indeed, this will introduce an overhead, and this will be a considerable overhead for certain application patterns. For example, a loop like

for (int i=0; i<veryOften; i++) multiply(someSmallMatricesOfSize3x3);

Doing this in Java with JOCLBLAS will probably be far slower than doing this in C with clBLAS.
However, even in C this will likely be far slower with clBLAS than when the matrix multiplication is done in plain C!

Or to state it somewhat suggestively:

The overhead of a kernel call (compared to a plain C implementation) may be compensated when the matrix size exceeds 10x10
The overhead of a JNI-call and a kernel call (compared to a plain Java implementation) may be compensated when the matrix size exceeds 20x20
(These numbers are just guesses, of course).

Some libraries, like CUBLAS, therefore have „Batched“ versions of certain functions - for example, cublasSgemmBatched - in order to avoid the launch overhead for the case of many, small matrices.

[QUOTE=dragandj;128239]
The problem is that the kernels have to be extremely low-level and architecture-dependent. clBLAS does it by heavy relying on C macros and a bunch of C code generation - that’s why you couldn’t find the CL files. There are no pure CL files with kernels there, just a bunch of horrible hodge-podge of strings inside C++ files that contain C code full of macros.

If you go the pure-kernel route, there are a couple of things to have in mind:

Much of the code logic is not in kernels, but on the host, so you’ll have much more to do than to generate API to kernel calls.
Each architecture needs separate host and kernels implementations, or you are going the macro route like clBLAS and end up with a mess.
Those kernels need a lot, lot, lot debugging and fine tuning.

You can see my implementation and benchmarks at Neanderthal - Fast Native Matrix and Linear Algebra in Clojure
For now, it is not a good idea for me to try to migrate to jocl-clblas, because of the performance overhead, but when that is solved, jocl-clblas would be THE Java library for such computations.[/QUOTE]

Yes, when I tried to find the Kernels in clBLAS (naively hoping for (but of course, not expecting) a directory called „kernels“, with plain, dependency-free files like gemm.cl, dot.cl, etc) I eventually found this macro trickery that they did there. But I still have to read things like AutoGemm · clMathLibraries/clBLAS Wiki · GitHub more thoroughly…

On the one hand, this might seem reasonable: There are too many degrees of freedom to have a single, plain gemm.cl file covering all devices, architectures, shared memory sizes, matrix sizes and parameters. On the other hand, it’s a pity The idea of OpenCL is exactly that: You should not have to care about the target device (so much), but instead write a generic kernel with generic instructions for any (parallel) „computing device“ with certain features. It should then be the job of the OpenCL implementation compiler to make the best out of this kernel.

Once I had a glimpse at the CUBLAS source code, and (from a code maintenance standpoint) it’s a mess: It’s abusing macro trickery to turn the C-preprocessor into a code generator, causing the (seemingly simple) SGEMM operation to become unreadable.

Obviously, Guy Blellochs NESL was not only 20 years, but maybe 30 years ahead of time

dragandj · 13. Januar 2016 um 05:02

Even better that the benchmarks were JOCL vs CUBLAS (i somehow missed that part).

However, regarding the optimization: As far as I know, you do not have to recompile clBLAS, but you have to optimize the kernel generation and parameters in runtime. The idea of OpenCL (in my understanding, and it seems to be how it works) is not that you write one code that is optimal on all platforms, but that you write code in one language and set of APIs that could optimize for many platforms. So, either you:

write general code that work OKish or pathetically for most platforms, but far from optimal
write a dedicated optimized kernels for each platform
optimize parts of the code for each platform using macros (ugly and not always enough)
write some code generation tool, as they do in clBLAS (even uglier)

Just to mention, the solution that you mentioned, of having one kernel working for all architectures: any kernel can work on any architecture, but the speed difference can be even several orders of magnitude…

I prefer option 2, since there are not 1000 architectures for our use-case, but generally a couple - AMD GPU, CPU, nvidia GPU.

You can read more about clBLAS tuning at Cedric Nugteren | OpenCL matrix-multiplication tutorial for Kepler

*** Edit ***

If you follow the link to the tutorial, you’ll see that clBLAS does not perform well on nvidia, but is much faster on amd. Not as fast as cublas, but, let’s say 30% slower than cublas on comparable nvidia hardware instead of a couple times slower.

Marco13 · 13. Januar 2016 um 08:45

This is probably related to (or covered with) the „AutoGemm“ part that I referred to. The goal until now was to get the bindings running, and I hope that these optimization options will not collide with any approach taken so far.

dragandj;128263:

The idea of OpenCL (in my understanding, and it seems to be how it works) is not that you write one code that is optimal on all platforms, but that you write code in one language and set of APIs that could optimize for many platforms. So, either you:

write general code that work OKish or pathetically for most platforms, but far from optimal

write a dedicated optimized kernels for each platform

optimize parts of the code for each platform using macros (ugly and not always enough)

write some code generation tool, as they do in clBLAS (even uglier)

Just to mention, the solution that you mentioned, of having one kernel working for all architectures: any kernel can work on any architecture, but the speed difference can be even several orders of magnitude…

I prefer option 2, since there are not 1000 architectures for our use-case, but generally a couple - AMD GPU, CPU, nvidia GPU.

Until now, I thought that the main difference would be that between GPU and CPU, and I didn’t expect „significant“ differences between AMD and NVIDIA GPUs, but it seems that this may not necessarily be true, after…

[QUOTE=dragandj;128263]
You can read more about clBLAS tuning at Cedric Nugteren | OpenCL matrix-multiplication tutorial for Kepler

If you follow the link to the tutorial, you’ll see that clBLAS does not perform well on nvidia, but is much faster on amd. Not as fast as cublas, but, let’s say 30% slower than cublas on comparable nvidia hardware instead of a couple times slower.[/QUOTE]

… quickly skimming over these sites. Until now, I really only looked at the figures, but maybe I got something wrong there (the tutorial is quite elaborate - certainly worth allocating some more time to read it thoroughly). In any case, I’ll try to proceed with my other tasks, and with the code generator, aiming at a full support of the clMathLibraries - further optimizations of these libraries should then „transparently“ become available in the JOCL-versions.

dragandj · 13. Januar 2016 um 15:05

Not only that the kernels must be specific to AMD and nvidia, but even for one vendor, different generations may need different kernels. Some differences can be covered by macros, but some may even be at the broader level and require different algorithms. CPU and GPU require completely different approaches for most stuff.

Marco13 · 14. Januar 2016 um 01:58

Well, certain parameters, for example, the ubiquitous BLOCK_SIZE for the size of some shared memory region, will depend on the device. And I think that’s OK. Querying some device parameters and doing some string replacements or #defines before compiling the kernel is fine.

The larger problem I see is when kernels are tweaked in a way that can not be covered with such a simple parameter set, e.g.

Using cl_float on a GPU but cl_float4 on a CPU (because they can be translated to some AVX instructions)
Using shared memory on a GPU, but no shared memory on a CPU (because “shared memory” does not exist on a CPU, and is emulated with “global memory”)
Even worse, and affecting the host code as well: Using image buffers instead of normal cl_mem buffers for certain operations…

But maybe it’s simply the usual trade-off between genericity and performance…

dragandj · 11. März 2016 um 06:02

Any updates regarding the release of JOCL and preview of JOCLBLAS?

Marco13 · 11. März 2016 um 08:40

Not directly. Recently, I cleaned up my code generator (actually, aiming at making http://jvulkan.org/ come true but) with JOCBLAS and JOCLSPARSE in mind. I’ll continue with this again during the weekend, and hopefully can create an updated version of JOCLBLAS next week (and maybe a first shot of JOCLSPARSE shortly after that).

dragandj · 11. März 2016 um 08:49

thanks although my primary concern is the final Win/Linux/MacOSX release of JOCL 2.0, the next step is trying to integrate Neanderthal (JOCL-based matrix lib) with JOCBLAS and JOCLSPARSE.

*** Edit ***

BTW did you look at CLBLAST? It aims to be much cleaner (and faster on nvidia!) implementation of BLAS for OpenCL. It is fairly complete. GitHub - CNugteren/CLBlast: Tuned OpenCL BLAS

Marco13 · 11. März 2016 um 17:31

The final release (including all binaries) depends on the contributions (particularly for MacOS. Recently, I at least downloaded VirtualBox, maybe I can sooner or later create the Linux binaries then). (Apart from that, officially, OpenCL is already version 2.1 now…)

I wasn’t aware of CLBlast - there are some other Matrix libraries for OpenCL (the most well known probably ViennaCL), but not all of them lend themself to be accessed from Java directly. One of the major issues here are templates. CLBlast also uses them, but also offers a pure C API, so I’ll definitely have a closer look at that one.

dragandj · 12. März 2016 um 01:19

You probably meant „MacOS binaries“ since Linux binaries are already there - I use them for great good?

Marco13 · 12. März 2016 um 07:31

I meant that for future versions, I will then be able to create the Linux binaries immediately, together with the Windows ones (although there are some quick contributors for the Linux ones I’d really like to be able to just trigger a release, and immediately have all binaries in place…)

Marco13 · 12. März 2016 um 17:38

Just a short update: I ran the generator over the clSPARSE, and it uses some rather … “unusual” concepts, so I’ll have to revise this later.

But the C-API of CLBlast indeed looks very clean and straighforward, so this could be easier to create. It seems to omit some of the degrees of freedom that clBLAS has - primarily, the option to dispatch to multiple queues and enqueue the operations with given cl_event lists. I guess that for most application cases, these are not really important. They certainly will be relevant, when you’re about to do HPC with multiple GPUs and sophisticated synchronization, but I guess that most people simply want to call the basic BLAS routines as-they-are.

(I didn’t yet try to compile CLBlast - just used the headers to see what the API looks like. But I’ll try to compile it and create the bindings in the next days)

dragandj · 13. März 2016 um 01:57

Great news! If there are any issues with CLBlast, please contact the author - he is very responsive.

He has just released the new version

Marco13 · 15. März 2016 um 11:05

A small update for JOCLBLAS has just been pushed, but I’m still in the process of … “maintenance” of some parts.

The update also refers to the latest version of clBLAS. I’m not sure whether last time (end of 2015) I somehow omitted the “AutoGemm” part (that they refer to in the Wiki), but this time

I had to install Python
The compilation took longer
The resulting DLL has a whopping 18 MEGAbytes!
Packing such a large library (and maybe Linux/MacOS binaries with similar sizes) into a JAR (which, in the end, might be ~60MB) leaves a somewhat uncanny feeling. But maybe this is outweight by the potential convenience of having one über-JAR that can just be used on any OS…

In any case, I’ll try to tackle CLBlast tomorrow. Although I usually don’t like “publishing” stuff that does not have a certain level of maturity, some early tests (and maybe feedback) with different libs may be helpful in the long run.

dragandj · 15. März 2016 um 11:48

Marco, where should I look for those binaries/source code? What is the official source repository?

Marco13 · 16. März 2016 um 03:10

The source code is at https://github.com/gpu/JOCLBLAS/

The build instructions are similar to that of JOCL: There is a CMake file and a Maven POM (athough both have to be reviewed - particularly, the CMake file for the dependency to clBLAS)

Pre-compiled binaries (and source code archives etc) will be added to the website jocl.org, once things have settled a bit.