JCublas DSYRK and DGEMM benchmark

NigelEssence · 2. Dezember 2015 um 02:48

I am doing a lot of processing and it is taking days, even with CUDA. So maybe I ought to get a more powerful GPU card. Also I did some research and it seems there are faster CUBLAS GEMM versions, and maybe I can take advantage of a tuned GEMM in my SYRK impplemantation. I do everytig double precision. Butit seems from my own tests that there isn;t much room for improvement - I need it x100 faster!

Would be interested if anybody can report performance with the following code together with details of what GPU they are using. Might be useful for others looking at purchasing a GPU card.

Thanks.

package datamining;
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

import java.util.concurrent.*;
import java.util.Random;
import java.util.*;

import jcuda.*;
import jcuda.runtime.*;
import static jcuda.runtime.JCuda.*;
import jcuda.jcublas.JCublas;
import static jcuda.jcublas.JCublas2.*;
import static jcuda.jcublas.cublasOperation.*;
import static jcuda.jcublas.cublasFillMode.*;
import static jcuda.driver.JCudaDriver.*;
import jcuda.jcublas.cublasHandle;
import jcuda.driver.*;
import java.util.Arrays;
import jcuda.jcublas.JCublas2;

//import jcuda.utils.KernelLauncher;
/**
 *
 * @author Nigel
 */
public class TestCuda {

    private static int numTasks = Runtime.getRuntime().availableProcessors();

    private TestCuda() {
        super();
    }

    public static void main(String[] args) {

        System.getProperties().list(System.out);
        Pointer pointer = new Pointer();
        JCuda.cudaMalloc(pointer, 4);
        JCuda.cudaFree(pointer);

        int[] nDevices = new int[1];
        CUdevice[] devices;
        JCudaDriver.setExceptionsEnabled(true);
        JCuda.setExceptionsEnabled(true);
        JCublas2.setExceptionsEnabled(true);
        JCudaDriver.cuInit(0);
        int count = JCudaDriver.cuDeviceGetCount(nDevices);
        if (nDevices[0] == 0) {
            System.out.println("No GPU devices found" + count + " " + nDevices[0]);
            return;
        }

        System.out.println("Total number of devices: " + nDevices[0]);
        devices = new CUdevice[nDevices[0]];
        for (int dev = 0; dev < nDevices[0]; dev++) {
            devices[dev] = new CUdevice();
            int ireturn = JCudaDriver.cuDeviceGet(devices[dev], dev);
            int[] major = new int[1];
            int[] minor = new int[1];
            ireturn = JCudaDriver.cuDeviceComputeCapability(major, minor, devices[dev]);
            System.out.println("Version: " + String.format("%d.%d", major[0], minor[0]));
            cudaDeviceProp deviceProp = new cudaDeviceProp();
            JCuda.cudaGetDeviceProperties(deviceProp, dev);
            System.out.println(deviceProp.toFormattedString());
        }

        long current;
        int total = 40000 * 200 * 200;
        int nCols = 200;
        int nIts = 5;
        while (nCols < 1000) {
            int nRows = total / (nCols * nCols);
            double dh_A[] = createRandomDoubleData(nRows * nCols);
            double dh_B[] = new double[nRows * nCols];
            System.arraycopy(dh_A, 0, dh_B, 0, nRows * nCols);
            double dh_C[] = createRandomDoubleData(nCols * nCols);
            System.out.println("rows " + nRows + " iterations " + nIts + " numTasks " + numTasks + " nCols " + nCols);
            Pointer d_A = new Pointer();
            Pointer d_B = new Pointer();
            Pointer d_C = new Pointer();

            JCublas.cublasAlloc(nRows * nCols, Sizeof.DOUBLE, d_A);
            JCublas.cublasAlloc(nCols * nCols, Sizeof.DOUBLE, d_C);
            current = System.currentTimeMillis();
            for (int j = 0; j < nIts; j++) {
                JCublas.cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
                JCublas.cublasDsyrk('U', 'T', nCols, nRows, 1.0d, d_A, nRows, 0.0d, d_C, nCols);
                JCublas.cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);
            }
            current = System.currentTimeMillis() - current;
            System.out.println("CUBLAS DSYRK " + current);
 
           JCublas.cublasFree(d_A);
            JCublas.cublasFree(d_C);
//        
            current = System.currentTimeMillis();
            JCublas.cublasAlloc(nRows * nCols, Sizeof.DOUBLE, d_A);
            JCublas.cublasAlloc(nRows * nCols, Sizeof.DOUBLE, d_B);
            JCublas.cublasAlloc(nCols * nCols, Sizeof.DOUBLE, d_C);
            for (int j = 0; j < nIts; j++) {
                JCublas.cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
                JCublas.cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_B), 1, d_B, 1);
                JCublas.cublasSetVector(nCols * nCols, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
                JCublas.cublasDgemm('T', 'N', nCols, nCols, nRows, 1.0d, d_A, nRows, d_B, nRows, 0.0d, d_C, nCols);
                JCublas.cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);
            }
            current = System.currentTimeMillis() - current;
            System.out.println("CUBLAS DGEMM " + current);
            JCublas.cublasFree(d_A);
            JCublas.cublasFree(d_B);
            JCublas.cublasFree(d_C);
//
        
            try {
                cublasHandle handle = new cublasHandle();
                cublasCreate(handle);
                Pointer pAlpha = Pointer.to(new double[]{1.0d});
                Pointer pBeta = Pointer.to(new double[]{0.0d});

                cudaMalloc(d_A, nRows * nCols * Sizeof.DOUBLE);
                cudaMalloc(d_C, nCols * nCols * Sizeof.DOUBLE);
                current = System.currentTimeMillis();
                 for (int j = 0; j < nIts; j++) {
               cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
                cublasSetVector(nCols * nCols, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
                    cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, nCols, nRows, pAlpha, d_A, nRows, pBeta, d_C, nCols);
                cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);
                }
                current = System.currentTimeMillis() - current;
                System.out.println("CUBLAS2 DSYRK " + current);
                cudaFree(d_A);
                cudaFree(d_C);
        
//
                cudaMalloc(d_A, nRows * nCols * Sizeof.DOUBLE);
                cudaMalloc(d_B, nRows * nCols * Sizeof.DOUBLE);
                cudaMalloc(d_C, nCols * nCols * Sizeof.DOUBLE);
                current = System.currentTimeMillis();
                for (int j = 0; j < nIts; j++) {
                    cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
                    cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_B), 1, d_B, 1);
                    cublasSetVector(nCols * nCols, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
                    cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, nCols, nCols, nRows, pAlpha, d_A, nRows, d_B, nRows, pBeta, d_C, nCols);
                    cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);
                }
                current = System.currentTimeMillis() - current;
                System.out.println("CUBLAS2 DGEMM " + current);
                cudaFree(d_A);
                cudaFree(d_B);
                cudaFree(d_C);
                cublasDestroy(handle);
            } catch (Exception e) {
                System.out.println(e);
            }
            nCols += 1;
        }
        JCublas.cublasShutdown();

    }

    private static double[] createRandomDoubleData(int n) {
        Random random = new Random();
        double x[] = new double[n];
        for (int i = 0; i < n; i++) {
            x** = random.nextDouble();
        }
        return x;
    }

}

*** Edit ***

My impression s that with the dimensions I have, data transfer is not the issue. Data transfer goes as nRows x nCols, processing, goes as nRows x nCols x nCols.

For example I get (note the jump when nCols goes to 209 - 208 is a multiple of 16



Device properties:
    name=GeForce GT 640
    totalGlobalMem=4294967296
    sharedMemPerBlock=49152
    regsPerBlock=65536
    warpSize=32
    memPitch=2147483647
    maxThreadsPerBlock=1024
    maxThreadsDim=[1024, 1024, 64]
    maxGridSize=[2147483647, 65535, 65535]
    clockRate=797000
    totalConstMem=65536
    major=3
    minor=0
    textureAlignment=512
    texturePitchAlignment=32
    deviceOverlap=1
    multiProcessorCount=2
    kernelExecTimeoutEnabled=1
    integrated=0
    canMapHostMemory=1
    computeMode=cudaComputeModeDefault
    maxTexture1D=65536
    maxTexture1DMipmap=16384
    maxTexture1DLinear=134217728
    maxTexture2D=[65536, 65536]
    maxTexture2DMipmap=[16384, 16384]
    maxTexture2DLinear=[65000, 65000, 1048544]
    maxTexture2DGather=[16384, 16384]
    maxTexture3D=[4096, 4096, 4096]
    maxTexture3DAlt=[2048, 2048, 16384]
    maxTextureCubemap=16384
    maxTexture1DLayered=[16384, 2048]
    maxTexture2DLayered=[16384, 16384, 2048]
    maxTextureCubemapLayered=[16384, 2046]
    maxSurface1D=65536
    maxSurface2D=[65536, 32768]
    maxSurface3D=[65536, 32768, 2048]
    maxSurface1DLayered=[65536, 2048]
    maxSurface2DLayered=[65536, 32768, 2048]
    maxSurfaceCubemap=32768
    maxSurfaceCubemapLayered=[32768, 2046]
    surfaceAlignment=512
    concurrentKernels=1
    ECCEnabled=0
    pciBusID=1
    pciDeviceID=0
    pciDomainID=0
    tccDriver=0
    asyncEngineCount=1
    unifiedAddressing=1
    memoryClockRate=891000
    memoryBusWidth=128
    l2CacheSize=262144
    maxThreadsPerMultiProcessor=2048
    streamPrioritiesSupported=0
    globalL1CacheSupported=0
    localL1CacheSupported=1
    sharedMemPerMultiprocessor=49152
    regsPerMultiprocessor=65536
    managedMemory=1
    isMultiGpuBoard=0
    multiGpuBoardGroupID=0
    
rows 40000 iterations 5 numTasks 8 nCols 200
CUBLAS DSYRK 768
CUBLAS DGEMM 1219
CUBLAS2 DSYRK 765
CUBLAS2 DGEMM 1218
rows 39602 iterations 5 numTasks 8 nCols 201
CUBLAS DSYRK 746
CUBLAS DGEMM 1297
CUBLAS2 DSYRK 747
CUBLAS2 DGEMM 1261
rows 39211 iterations 5 numTasks 8 nCols 202
CUBLAS DSYRK 741
CUBLAS DGEMM 1296
CUBLAS2 DSYRK 740
CUBLAS2 DGEMM 1262
rows 38826 iterations 5 numTasks 8 nCols 203
CUBLAS DSYRK 742
CUBLAS DGEMM 1274
CUBLAS2 DSYRK 743
CUBLAS2 DGEMM 1190
rows 38446 iterations 5 numTasks 8 nCols 204
CUBLAS DSYRK 733
CUBLAS DGEMM 1190
CUBLAS2 DSYRK 731
CUBLAS2 DGEMM 1168
rows 38072 iterations 5 numTasks 8 nCols 205
CUBLAS DSYRK 734
CUBLAS DGEMM 1209
CUBLAS2 DSYRK 732
CUBLAS2 DGEMM 1155
rows 37703 iterations 5 numTasks 8 nCols 206
CUBLAS DSYRK 735
CUBLAS DGEMM 1209
CUBLAS2 DSYRK 735
CUBLAS2 DGEMM 1176
rows 37340 iterations 5 numTasks 8 nCols 207
CUBLAS DSYRK 733
CUBLAS DGEMM 1193
CUBLAS2 DSYRK 733
CUBLAS2 DGEMM 1151
rows 36982 iterations 5 numTasks 8 nCols 208
CUBLAS DSYRK 723
CUBLAS DGEMM 1174
CUBLAS2 DSYRK 729
CUBLAS2 DGEMM 1152
rows 36629 iterations 5 numTasks 8 nCols 209
CUBLAS DSYRK 762
CUBLAS DGEMM 1366
CUBLAS2 DSYRK 762
CUBLAS2 DGEMM 1280

*** Edit ***

My back of the envelope makes this around 12.6 GFlops, if you include multiply and add operations.

1,600,000,000 multiply/adds repeated 5 times takes 1.2 secs.

so that is 1.6/0.24 = about 6.4 multiply/adds per second = 12.6 GFlops if you count multiply and add separately.

Seems very far short of other benchmarks I have seen.

*** Edit ***

GTX Titan (not the X) gives 1306 GFlops on DGEMM Titan’s Compute Performance (aka Ph.D Lust) - NVIDIA’s GeForce GTX Titan Review, Part 2: Titan’s Performance Unveiled

If corrct, that is the kind of performance I need. But I’m not going to go out and spend $1000 without some due diligence.

Marco13 · 2. Dezember 2015 um 06:54

Just a short note: I’ll run the benchmark later today, it will be on a GTX 970

NigelEssence · 2. Dezember 2015 um 07:09

Thanks. Main issue seems to be that fp64 performance varies widely, and is not often included in performance comparisons. The number of fp64 units compared to fp32 seems to vary a lot. I guess most of us here are looking at compute performance, whereas the main market (unless you are a government research lab) is fp32 graphics.

A quick look shows GTX 970 has a fp64 1/32 of fp32, whereas Titan has 1/3 fp32. Big difference. Titan also seems cheaper.

Marco13 · 2. Dezember 2015 um 11:29

I’m not sure how you derived the numbers for the FP64/32 relations (or, if you just found them - like me - on sites like this one: How they did their benchmarks). But I agree, double precision is simply not required in most purely graphical applications, and it obviously took (and still takes) a while until the demand for double precision in purely compute-based applications causes a shift in the focus of the GPU manufacturers.

First of all, I did a quick run of your benchmark on the GTX 970


Device properties:
    name=GeForce GTX 970
    totalGlobalMem=4294967296
    sharedMemPerBlock=49152
    regsPerBlock=65536
    warpSize=32
    memPitch=2147483647
    maxThreadsPerBlock=1024
    maxThreadsDim=[1024, 1024, 64]
    maxGridSize=[2147483647, 65535, 65535]
    clockRate=1253000
    totalConstMem=65536
    major=5
    minor=2
    textureAlignment=512
    texturePitchAlignment=32
    deviceOverlap=1
    multiProcessorCount=13
    kernelExecTimeoutEnabled=1
    integrated=0
    canMapHostMemory=1
    computeMode=cudaComputeModeDefault
    maxTexture1D=65536
    maxTexture1DMipmap=16384
    maxTexture1DLinear=134217728
    maxTexture2D=[65536, 65536]
    maxTexture2DMipmap=[16384, 16384]
    maxTexture2DLinear=[65000, 65000, 1048544]
    maxTexture2DGather=[16384, 16384]
    maxTexture3D=[4096, 4096, 4096]
    maxTexture3DAlt=[2048, 2048, 16384]
    maxTextureCubemap=16384
    maxTexture1DLayered=[16384, 2048]
    maxTexture2DLayered=[16384, 16384, 2048]
    maxTextureCubemapLayered=[16384, 2046]
    maxSurface1D=65536
    maxSurface2D=[65536, 32768]
    maxSurface3D=[65536, 32768, 2048]
    maxSurface1DLayered=[65536, 2048]
    maxSurface2DLayered=[65536, 32768, 2048]
    maxSurfaceCubemap=32768
    maxSurfaceCubemapLayered=[32768, 2046]
    surfaceAlignment=512
    concurrentKernels=1
    ECCEnabled=0
    pciBusID=1
    pciDeviceID=0
    pciDomainID=0
    tccDriver=0
    asyncEngineCount=2
    unifiedAddressing=1
    memoryClockRate=3505000
    memoryBusWidth=256
    l2CacheSize=1835008
    maxThreadsPerMultiProcessor=2048
    streamPrioritiesSupported=0
    globalL1CacheSupported=0
    localL1CacheSupported=0
    sharedMemPerMultiprocessor=98304
    regsPerMultiprocessor=65536
    managedMemory=1
    isMultiGpuBoard=0
    multiGpuBoardGroupID=0

The results for the range that you referred to (around 208) are here:


rows 37340 iterations 5 numTasks 4 nCols 207
CUBLAS DSYRK 266
CUBLAS DGEMM 375
CUBLAS2 DSYRK 281
CUBLAS2 DGEMM 375
rows 36982 iterations 5 numTasks 4 nCols 208
CUBLAS DSYRK 281
CUBLAS DGEMM 375
CUBLAS2 DSYRK 282
CUBLAS2 DGEMM 375
rows 36629 iterations 5 numTasks 4 nCols 209
CUBLAS DSYRK 266
CUBLAS DGEMM 391
CUBLAS2 DSYRK 281
CUBLAS2 DGEMM 391

There are some caveats, however, Although the relative time for the memory transfers should become smaller, they probably should not be neglected completely. I considered extending the benchmark (in general, the current JCuda samples are lacking some benchmarks anyhow).

Such extensions could involve

separate timings for compute and memory transfer
easily comparing the float- and double performance
computing the GFLOPS automatically
varying step sizes and parameters
… (infintely many points may be listed here, but this would be start)…

Note that the actual performance may also (heavily) depend on the parameters that I mentioned. I’m pretty sure that the peak performances (that tend to be reported in benchmarks, by the manufacturers) are only reached for very specific constellations. Particularly, I’m pretty sure that these GFLOPS will usually refer to square matrices. In this case, the “transposition parameter” can always be chosen to match the “perfect” memory layout (e.g. 'N' for the first matrix and … maybe 'T' for the second one? This is something that could nicely be analyzed with an extended benchmark…). I’m also pretty sure that in these benchmarks, they will only use matrices that have the perfect “power of two” sizes for the respective block/grid size. Once I got access to the source code of CUBLAS 1.x, and saw that they employed quite some tricks in order to cover all cases as efficiently as possible, but in general, “perfect performance” likely requires the “perfect input” for the target device.

I’ll try to allocate some time for such an extended benchmark, this could indeed bring some interesting insights. (There are some other items in my queue … related ones (Java Bindings for clBLAS, for example), but also unrelated ones, but I guess a first version of such a benchmark should not take too long).

NigelEssence · 2. Dezember 2015 um 13:55

Interesting. The speedups you find wouldn’t make me go out and purchase a 970. I think I got the fp64/fp32 figures from anandtech.com - let me check and put in links…here we go

AMD Radeon and NVIDIA GeForce FP32/FP64 GFLOPS Table | Geeks3D

NVIDIA’s GeForce GTX Titan Review, Part 2: Titan’s Performance Unveiled

The NVIDIA GeForce GTX 970 Review: Featuring EVGA

I think what surprises me is the difficulty I had finding any decent benchmarks for Dgemm. I would have expected to find benchmark figures for most GPU’s with a variety of rows/columns. Presumably Oak Ridge etc. did lots of benchmarks but they are not public.

It would be great if there was a rich JCuda user who has a Titan who could try the benchmark. Note, again, Titan is very different to Titan X. Titan Z would be nice - I think it is two Titans.

Yes, I thought about adding fp32 benchmarks, but they aren’t really of much interest to me. Easy enough to do.

Yes, a lot of tests seem to be about square matrices. My matrices are 150,000 x 200 or thereabouts.

But really this is not a JCuda or Java issue (at least I hope it isn’t), it should be of interest to any CUDA developer.

Not sure what my next step is - earn some money so I can afford a Titan?

NigelEssence · 4. Dezember 2015 um 07:54

A bit of investigation M2090 prices seem to have fallen drastically, so this might be a good compromise

NigelEssence · 4. Dezember 2015 um 10:13

Apart from the fact that cooling and lower supply are going to be an issue…it is not designed for desktops but servers!

Marco13 · 4. Dezember 2015 um 11:01

Admittedly, I don’t have an overview of the all the cards and their versions and street prices (and the sometimes confusing marketing names - you mentioned „Titan X“, but I’m not even sure whether this really is what is called the „GTX Titan“ elsewhere…)

Of course, when it comes to new CUDA versions, or even new Compute Capability or GL versions, or other „significant“ innovations, that may be interesting, but … as long as the difference between two cards is „only“ the number of cores, the clock frequency and/or the memory and its bandwidth, then it not sooo relevant for me. I cannot afford to (and simply do not want to) buy a new card every few months just because „it is faster“. (My previous one was a GeForce 8800, and there, the lack of functionalities and the low CC became too pressing, obviously - I couldn’t even test the CC 2.0 features…).

The tables e.g. on Wikipedia contain loads of information (unfortunately, not in a consistent format for the different series, so it’s a hassle to compare them). Regarding the double precision and the relative speed compare to single, I stumbled over the statement

Double precision performance of the GTX Titan & GTX Titan Black is either 1/3 or 1/24 of single-precision performance depending on a user-selected configuration option in the driver that boosts single-precision performance if double-precision is set to 1/24 of single-precision performance,[9] while other Kepler chips’ double precision performance is fixed at 1/24 of single-precision performance.[10] GeForce 700 series Maxwell chips’ double precision performance is 1/32 of single-precision performance.[11]

(from GeForce 700 series - Wikipedia )

However, I might even soon get access to a M6000 card in my office. This is one of these „Quadro“ cards, and coincidentally, „6000“ is not only the name, but also the price (in €uros!) of the card :rolleyes: (Sounds like a lot of burnt money for me, but it may be justified for some use cases…)

Regarding the benchmarks: It’s indeed a bit surprising that so little information/comparisons can be found for the different cards. I’m not so involved here either, and could only speculate about the reasons.

Benchmarking and performance tweaking is an art
For a comparison of the cards, one needs to have all the cards available
Most GEMM benchmarks seem to aim at comparing libraries, and not so much hardware…
…
The latter might be relevant here indirectly: Assuming that one can say how much of the theoretical peak performance (aka FLOPS) a library (like CUBLAS) can achieve on a certain card, then one could (theoretically) compute the FLOPS that it would achieve on a different card, taking into account the clock frequency and number of cores (but of course, it might become trickier when also considering e.g. the memory bandwidth or so).

The list on the Geeks3D website looks extensive, but also seems to be only a compilation of the theoretical flops, and not created with real benchmarks.

A while ago, I started creating a small benchmarking library (not targeting CUDA, just so…), I’ll see whether I can polish it during the weekend to become usable. Otherwise, I’ll try to extend the existing benchmark manually in the way that I described above.

NigelEssence · 4. Dezember 2015 um 17:00

Yes, Titan X and Titan are completely different, but I’m not sure how Titan and Titan Black relate. All very confusing. Very difficult to get sensible information. Double precision floating point seems particularly difficult to compare.

I do find it surprising that nobody in the CUDA community has set up suitable benchmarks and gathered performance results. Surely not that difficult, just needs a bit of organising. ‘Here is a benchmark, please run it, return the output results, and provide the following information about the hardware’. Simple. Not sure why NVidia don’t do it themselves. At least run SGEMM and DGEMM on every card with some version of CUBLAS, and show results for different matrix sizes. I would even accept all square matrices. Just Do It.

*** Edit ***

Having said all this, and considered various alternatives including power supply and actual provision of graphics (which M2090 doesn’t do - it is purely compute) I think I will wait a bit for prices to fall and my preferred option would be a Titan (not X). It is a strange mixture of compute and consumer card. I would need to add a power supply to my desktop box. I have also read that mixing cards in a single setup can cause problems. If I am going to change, it has to be worth the effort and time and hassle.

Fun couple of days (not) learning about these things. I hate computers. You have a bright idea, then a few days later you have spent a lot of money and nothing works at all, not a beep. In my early days I even tried to build a computer (NASCOM I think) and I just about got a flicker on the television screen. Nightmare.

Marco13 · 4. Dezember 2015 um 17:57

Again, I think comparisons on this level mainly aim at comparing libraries (as long as we’re not talking about “standardized” / “reference” implementations of benchmarks like LAPACK). Most graphics cards are (albeit being used as workhorses) still viewn as consumer products, and the performance is not so much measured in FLOPS (and even less double-FLOPS), but rather in FPS or arbitrary 3DMark scores.

It’s also clear that there are caveats for an objective comparison. On the one hand, one would have to make sure to always use the same CUBLAS version, to make the results comparable (although negecting possible other influences, like the OS or host system). But on the other hand, one might ask: If there is a new feature in CUDA/CUBLAS that makes GEMM on certain high-end-cards even faster (e.g some larger shared memory or cache or whatnot), why shouldn’t it be taken into account? One could say: The cards ARE so fast, why throttle them down with an old library?

In any case, nodody will take any result serious that was created with a JCublas-based benchmark, because it adds another layer of indirection. But I’m curious nevertheless.

NigelEssence · 5. Dezember 2015 um 04:28

I agree. But some process which gathers information and makes it available would surely be useful. It isn’t a technical problem, just have to be careful to give all the information necessary (library version, card version, card settings, etc.) and put it out there. Could include Intel PHI as well, and multi threaded XEON.

I think the issue is more of a business problem - too many competing interests, information is power, consultants want to be paid to evaluate, etc. etc.

I would pay $10 for access to such a database if it included DGEMM. Are there 100,000 people like me? Probably not!

*** Edit ***

At least Intel makes some attempt.

Intel® Xeon Phi™ Coprocessor SGEMM* and DGEMM* Throughput

Quite impressive figures, as far as I can see at a quick glance.

*** Edit ***

I found this

http://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture03.pdf

which led me to this

HPC Challenge Benchmark Results - Condensed Results - Base and Optimized Runs - 372 Systems - Generated on Sat Dec 5 06:00:44 2015

but there don;t seem to be many desktop systems here, and I didn’t see any nVidia figures.

*** Edit ***

These are more for systems rather than individual consumer cards and single GPU’s !

*** Edit ***

Also this

but again not much mention of nVidia

*** Edit ***

OK, I’m still learning - the M2090 is not designed for a workstation, it has no graphics output. But the c series is - C2050, 2070, 2075

NVIDIA Tesla Product Literature - overviews and documentation computing products | NVIDIA

*** Edit ***

Still rather expensive!

*** Edit ***

Maybe Quadro is the route

http://www.anandtech.com/show/9096/nvidia-announces-quadro-m6000-quadro-vca-2015

fp64 looks interesting

*** Edit ***

Apart from the prices…on ebay, 6000 is £300 upwards, K5200 £860 upwards, K6000 £1400 upwards!

6000 only has 448 cores, although fp64 is 1/2 fp32.

NigelEssence · 7. Dezember 2015 um 08:11

Found a titan black on ebay for just over £400 so bought it, should arrive Wednesday, so after I purchase a bigger power supply and install it I will have some benchmarks. Seemed a pretty good deal. My Christmas present to myself. Yes, I know, sad, but I have other presents lined up also…

Marco13 · 7. Dezember 2015 um 11:54

SSDs? RAM modules?

During the weekend, I continued a bit with my „benchmarking library“, but am not yet entirely sure where this will go. The focus there was not so much on timing - at least, not primarily. For these cases, there are Caliper, JMH & Co. This library originally aimed at benchmarking algorithms, in terms of input/output parameters and „qualities“. But coincidentally, this may also serve well for timing in CUDA, because there, the timings will not be computed „externally“, by some framework that measures the execution time of some method, but with internal mechanisms - namely, CUDA events.

However, in the meantime, I also created a first sketch of a benchmark with JCublas. Compared to what you originally posted… well, there are not sooo many similarities.

Obligatory cleanups (methods, comments…)
Removed Driver API parts. Should not be necessary.
Using only JCublas2. There should not be much difference between JCublas and JCublas2 performance-wise, but JCublas2 is the „new, official“ NVIDIA CUBLAS API, so I’d use it…
Timing with CUDA events
Separate timing of compute and memory transfers
Summaries with bandwidths (MB/s, Host-to-Device and vice versa) and GFLOPs

However, there are several degrees of freedom, and I’m not sure about some aspects (e.g. the FLOPS computation for DRYK), and it would be interesting to see the influences of different memory layouts, transposed and non-transposed case etc, the single-precision versions are still missing (I’d add them when the parametrizations of the double-precision ones are clearer), but… it’s a start, and may contain some building blocks that may be used for a „real“ benchmark application.

package jcuda.jcublas.misc;

import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasDgemm;
import static jcuda.jcublas.JCublas2.cublasDsyrk;
import static jcuda.jcublas.cublasFillMode.CUBLAS_FILL_MODE_UPPER;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_T;
import static jcuda.runtime.JCuda.cudaEventCreate;
import static jcuda.runtime.JCuda.cudaEventDestroy;
import static jcuda.runtime.JCuda.cudaEventElapsedTime;
import static jcuda.runtime.JCuda.cudaEventRecord;
import static jcuda.runtime.JCuda.cudaEventSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Random;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.runtime.JCuda;
import jcuda.runtime.cudaDeviceProp;
import jcuda.runtime.cudaEvent_t;

public class JCublasBenchmarks01
{
    /**
     * The entry point of this benchmark
     * 
     * @param args Not used
     */
    public static void main(String[] args)
    {
        // Enable exceptions and omit subsequent error checks
        JCuda.setExceptionsEnabled(true);
        JCublas2.setExceptionsEnabled(true);

        System.getProperties().list(System.out);
        printDeviceInformation();
        
        int iterations = 5;
        for (int cols = 100; cols <= 1000; cols += 100)
        {
            for (int rows = 100; rows <= 1000; rows += 100)
            {
                testDsyrk(rows, cols, iterations);
                testDgemm(rows, cols, cols, iterations);
            }
        }
    }
    
    /**
     * Helper class to collect timing information
     */
    static class Timing
    {
        long bytesHtoD;
        float msHtoD;
        
        long ops;
        float msCall;
        
        long bytesDtoH;
        float msDtoH;
        
        float msJavaMeasure;
        
        void print()
        {
            float mbPerS_HtoD = (float) bytesHtoD / msHtoD / 1024;
            float mbPerS_DtoH = (float) bytesDtoH / msDtoH / 1024;
            
            System.out.printf("Bytes     H->D : %14d
", bytesHtoD);
            System.out.printf("Time (ms) H->D : %14.3f
", msHtoD);
            System.out.printf("MB/s      H->D : %14.3f
", mbPerS_HtoD);

            System.out.printf("Ops            : %14d
", ops);
            System.out.printf("Time (ms)      : %14.3f
", msCall);
            System.out.printf("GFLOPS         : %14.3f
", getGFLOPS());
            
            System.out.printf("Bytes     D->H : %14d
", bytesDtoH);
            System.out.printf("Time (ms) D->H : %14.3f
", msDtoH);
            System.out.printf("MB/s      D->H : %14.3f
", mbPerS_DtoH);
            
            float totalMs = msHtoD + msCall + msDtoH;
            System.out.printf("Total time (ms): %14.3f
", totalMs);
            System.out.printf("     (Java (ms): %14.3f)
", msJavaMeasure);
        }
        
        float getGFLOPS()
        {
            float gflops = (float)ops / msCall / 1e6f;
            return gflops;
        }
    }
    
    /**
     * Print information about all available CUDA devices
     */
    private static void printDeviceInformation()
    {
        int deviceCount[] = { 0 };
        JCuda.cudaGetDeviceCount(deviceCount);
        if (deviceCount[0] == 0)
        {
            System.err.println("No devices found");
            System.exit(1);
        }
        System.out.println("Found "+deviceCount[0]+" devices");
        
        for (int device = 0; device < deviceCount[0]; device++)
        {
            System.out.println("Properties of device " + device + ":");
            cudaDeviceProp deviceProperties = new cudaDeviceProp();
            JCuda.cudaGetDeviceProperties(deviceProperties, device);
            System.out.println(deviceProperties.toFormattedString());
        }
    }
    
    
    private static void testDsyrk(
        int rows, int cols, int iterations)
    {
        // Create the CUBLAS handle
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);

        // Create the events for the time measures
        cudaEvent_t startEvent = new cudaEvent_t();
        cudaEvent_t stopEvent = new cudaEvent_t();
        cudaEventCreate(startEvent);
        cudaEventCreate(stopEvent);

        // The (constant) parameters passed to the main call
        final Pointer alpha = Pointer.to(new double[] { 1.0d });
        final Pointer beta = Pointer.to(new double[] { 0.0d });
        
        // Create the host data
        double h_A[] = createRandomDoubleData(rows * cols);
        double h_C[] = new double[rows * rows];

        // Create the device data
        Pointer d_A = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, h_A.length * Sizeof.DOUBLE);
        cudaMalloc(d_C, h_C.length * Sizeof.DOUBLE);

        // Variables for the time measures
        float[] ms =  { 0.0f };
        Timing timing = new Timing();

        long beforeNs = System.nanoTime();
        for (int i=0; i<iterations; i++)
        {
            // Copy the host data to the device, recording the time
            cudaEventRecord(startEvent, null);
            cudaMemcpy(d_A, Pointer.to(h_A), 
                rows * cols * Sizeof.DOUBLE,
                cudaMemcpyHostToDevice);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msHtoD += ms[0];
            timing.bytesHtoD += (long)rows * cols * Sizeof.DOUBLE;
    
            // Perform the main call, recording the time
            cudaEventRecord(startEvent, null);
            cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, 
                rows, cols, alpha, d_A, cols, beta, d_C, rows);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msCall += ms[0];
            timing.ops += (long)rows * rows * cols;
    
            // Copy the device data to the host, recording the time
            cudaEventRecord(startEvent, null);
            cudaMemcpy(Pointer.to(h_C), d_C, 
                rows * rows * Sizeof.DOUBLE,
                cudaMemcpyDeviceToHost);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msDtoH += ms[0];
            timing.bytesDtoH += (long)rows * rows * Sizeof.DOUBLE;
        }
        long afterNs = System.nanoTime();
        timing.msJavaMeasure = (afterNs - beforeNs) / 1e6f;

        // Clean up
        cudaFree(d_A);
        cudaFree(d_C);
        cudaEventDestroy(startEvent);
        cudaEventDestroy(stopEvent);
        cublasDestroy(handle);
        
        System.out.printf(
            "DSYRK, (%5d x %5d) * (%5d x %5d) = (%5d x %5d), " + 
            "%3d iterations: %14.3f GFLOPS
",
            rows, cols, cols, rows, rows, rows, 
            iterations, timing.getGFLOPS());
        //timing.print();
    }
    
    
    private static void testDgemm(
        int rowsA, int colsA, int colsB, int iterations)
    {
        int rowsB = colsA;
        int rowsC = rowsA;
        int colsC = colsB;
        
        // Create the CUBLAS handle
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);

        // Create the events for the time measures
        cudaEvent_t startEvent = new cudaEvent_t();
        cudaEvent_t stopEvent = new cudaEvent_t();
        cudaEventCreate(startEvent);
        cudaEventCreate(stopEvent);

        // The (constant) parameters passed to the main call
        final Pointer alpha = Pointer.to(new double[] { 1.0d });
        final Pointer beta = Pointer.to(new double[] { 0.0d });
        
        // Create the host data
        double h_A[] = createRandomDoubleData(rowsA * colsA);
        double h_B[] = createRandomDoubleData(rowsB * colsB);
        double h_C[] = new double[rowsC * colsC];

        // Create the device data
        Pointer d_A = new Pointer();
        Pointer d_B = new Pointer();
        Pointer d_C = new Pointer();
        cudaMalloc(d_A, h_A.length * Sizeof.DOUBLE);
        cudaMalloc(d_B, h_B.length * Sizeof.DOUBLE);
        cudaMalloc(d_C, h_C.length * Sizeof.DOUBLE);

        // Variables for the time measures
        float[] ms =  { 0.0f };
        Timing timing = new Timing();

        long beforeNs = System.nanoTime();
        for (int i=0; i<iterations; i++)
        {
            // Copy the host data to the device, recording the time
            cudaEventRecord(startEvent, null);
            cudaMemcpy(d_A, Pointer.to(h_A), 
                rowsA * colsA * Sizeof.DOUBLE,
                cudaMemcpyHostToDevice);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msHtoD += ms[0];
            timing.bytesHtoD += (long)rowsA * colsA * Sizeof.DOUBLE;
            
            cudaEventRecord(startEvent, null);
            cudaMemcpy(d_B, Pointer.to(h_B), 
                rowsB * colsB * Sizeof.DOUBLE,
                cudaMemcpyHostToDevice);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msHtoD += ms[0];
            timing.bytesHtoD += (long)rowsB * colsB * Sizeof.DOUBLE;
            
            cudaEventRecord(startEvent, null);
            cudaMemcpy(d_C, Pointer.to(h_C), 
                rowsC * colsC * Sizeof.DOUBLE,
                cudaMemcpyHostToDevice);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msHtoD += ms[0];
            timing.bytesHtoD += (long)rowsC * colsC * Sizeof.DOUBLE;
    
            // Perform the main call, recording the time
            cudaEventRecord(startEvent, null);
            cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 
                rowsA, colsA, colsB, alpha, d_A, rowsA, 
                d_B, rowsB, beta, d_C, rowsC);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msCall += ms[0];
            timing.ops += (long)rowsA * colsA * colsB * 2;
    
            // Copy the device data to the host, recording the time
            cudaEventRecord(startEvent, null);
            cudaMemcpy(Pointer.to(h_C), d_C, 
                rowsC * colsC * Sizeof.DOUBLE,
                cudaMemcpyDeviceToHost);
            cudaEventRecord(stopEvent, null);
            cudaEventSynchronize(stopEvent);
            cudaEventElapsedTime(ms, startEvent, stopEvent);
            timing.msDtoH += ms[0];
            timing.bytesDtoH += (long)rowsC * colsC * Sizeof.DOUBLE;
        }
        long afterNs = System.nanoTime();
        timing.msJavaMeasure = (afterNs - beforeNs) / 1e6f; 

        // Clean up
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        cudaEventDestroy(startEvent);
        cudaEventDestroy(stopEvent);
        cublasDestroy(handle);
        
        System.out.printf(
            "DGEMM, (%5d x %5d) * (%5d x %5d) = (%5d x %5d), " + 
            "%3d iterations: %14.3f GFLOPS
",
            rowsA, colsA, rowsB, colsB, rowsC, colsC, 
            iterations, timing.getGFLOPS());
        //timing.print();
    }
    

    private static double[] createRandomDoubleData(int n)
    {
        Random random = new Random();
        double x[] = new double[n];
        for (int i = 0; i < n; i++)
        {
            x** = random.nextDouble();
        }
        return x;
    }

}

NigelEssence · 7. Dezember 2015 um 15:49

I have plenty of RAM, and there is no disc access in my application. But I have to be careful object creation and garbage collection.

benchmark looks a lot more well thought out than my attempt in terms of code. But i would increase the maximum row and column size by quite a lot (up to maybe 50,000). I tried to vary it so that the number of ops was the same (nrow * ncolumn * ncolumn), so that the timings (number of flops) were similar - surely nobody is interested in 100x100? hence I had a fixed number (nrow * ncolumn * ncolumn) and varied the row number and calculated the column number. Or vice versa.

So I would radically rethink your overall structure of


   for (int cols = 100; cols <= 1000; cols += 100)
        {
            for (int rows = 100; rows <= 1000; rows += 100)
            {

Flops for syrk is 2nrowsncolumns*(ncolumns+1)/2 if you take a multiply and add as separate flops. For each element of the resulting upper triangular matrix, you have to do a dot product of two arrays of length rows. There are ncolumns*(ncolums+1)/2 upper triangular elements.

If you did the equivalent using gemm, it would be 2nrowsncolumns*ncolumns

The data transfer is nrows*ncolumns.

I did think about doing a whole bunch of tests, and then doing a regression against nrowsncolumns and nrowsncolumns*(ncolumns+1). This will then give you are formula for timings based on (a) the overhead (b) the data transfer © the calculation.

This stuff is bread and butter for me, been doing it for nearly 40 years.

5 is also a bit low for the number of iterations.

Marco13 · 7. Dezember 2015 um 16:43

There’s nothing better than RAM. Except for… more RAM (I also grabbed 32GB, just to be able to casually throw in some java -Xmx25g ... here and there)

Of course, the code until now was only a start. I tried to isolate the two functions so that they may be called with different parameters (although some parameters are still missing), and spill out the desired timing information. And I already thought that my maths for the dryk-FLOPs was wrong, but did not study the algorithm in detail.

Concerning the parametrization: I just chose some parameters that allowed me to quickly test it and get some first results. I think 100x100 may be reasonable, but of course, more cases have to be tested.

[ot]
In fact, this benchmarking lib aims at supporting exactly that: Parametrizations and parameter ranges. E.g. something like

List<ValueSet> valueSets = ValueSets.create()
    .interpolate("rows", 1000, 4000, 300)
    .derive("cols", "rows", r -> 40000 / (Integer)r)
    .interpolate("alpha", 0.0, 1.0, 0.25)
    .interpolate("beta", 0.0, 1.0, 0.25)
    .build();

would generate „ValueSets“ (i.e. parameters) where

„rows“ ranges from 1000 to 4000 (in steps of 300)
„cols“ is always „40000/rows“,
„alpha“ and „beta“ range from 0.0 to 1.0 (in steps of 0.25)

Yeah, well, this would build the cartesian product of these domains - so quite a lot of benchmark runs… The results could be collected and dumped into a CSV, each result possibly consisting of multiple values (e.g. the D->H/Call/H->D timings that I already added). I’m not sure whether this lib will eventually make any sense, but this CUBLAS benchmark at least serves as a nice potential use case, regardless of whether I’ll use it for that or not.
[/ot]

I’ll update the FLOPS computation for the syrk case, and think about the parameterizations, but … not today, probably…

NigelEssence · 7. Dezember 2015 um 16:58

OK that looks good. Before throwing RAM at a problem I always try to minimise memory needs. In Java, things like reusing objects is very important. I not only reuse thread pools, I reuse arrays, and expand them when necessary. Big mistake to specify too much memory, you will spend time garbage collecting.

When i was benchmarking Java concurrency, I found I got a lot of performance benefits by very careful implementation. Try to avoid creation of classes whenever possible, it makes a difference. Make a pool of classes which will be executed, don;t create a new class each time you need a new thread. Etc.

Yes, it is time for bed. Yes, CUDA benchmarking and GPU card benchmarking should not be the responsibility of JCUDA. Can you make a business out of benchmarking and make some money?

Marco13 · 8. Dezember 2015 um 03:11

Some off-topic here as well, but:

It depends. Excessively and blindly pooling things should probably not be the „baseline“ of the development strategy. Java has a garbage collector, and it is good, and it is becoming better. Java has escape analysis, and it is good, and it is becoming better. It has been a while since Brian Goetz wrote in Java theory and practice: Urban performance legends, revisited:

Allocation is faster than you think, and getting faster
…
Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.

But even though Brian Goetz certainly knows his stuff, I have observed cases where avoiding allocations significantly increased the performance. And this refers to simple cases of which one could think that they should be covered by the Escape Analysis. Creating and throwing away millions of simple objects like „Point2D“ or „Double“ (!) can lead to unacceptable „Full GC“ pauses. But I guess your „pooling“ did not refer to these cases. For larger, heavweight objects, there may be cases where one could consider some sort of pooling, hand in hand with a careful analysis of whether it is really advantageous.

As you might guess: The purpose of JCuda was not to make money. (I’m not sure what its purpose was, though). There certainly are companies that turned benchmarking into a business model. A careful and reliable performance analysis takes time, many resources (hardware), and lots of efforts, and it’s hard (or even impossible) to convince people that the results are really reliable (and not distorted by, say, things like garbage collection ;)). So the benchmarks that can be created for CUDA based on JCuda will at best be considered as „hints“, or rough indications (even though I personally think that the pure execution time of a GEMM, especially when it is measured with CUDA events, is largely independent of the hin Java/JNI layer that is wrapped around the actual call).

NigelEssence · 10. Dezember 2015 um 08:56

Totally off topic - yes I have seen, for example in 2d graphics, where Double or Point2D creation can have a significant effect. That is why I like to have a ‘cache’ of an array which is expanded and recreated when necessary. But I guess i was talking about multithreading, where you have a pool of 8 tasks, and you submit thousands of tasks to it. You first of all want a subtask which is defined as a class, and not created on the fly (I am terrible at Java terminology - I mean don’t do

        pool.submit(new Callable() {

but have an explicit callable class, maybe an array of callables, and instantiate it. These little things can be quite important if each thread is quite short.

I still say take care of memory and performance will take care of itself.

Marco13 · 10. Dezember 2015 um 10:10

It depends, and it is difficult. On the one hand, when measuring the performance really shows that, e.g. GC pauses are reduced, there’s hardly a doubt. I think the Point2D example is a nice one, because I thought about this particular one a lot. Modeling a nice Point2D or Point3D class that is easy and convenient to use, „nicely“ designed in OOP terms (also regarding mutability and immutability), and efficient at the same time is challenging. For the simple case, one could create Methods like

class Point2D {
    double x, y;

    Point2D add(Point2D p) { /* create and return a new point */ 
    Point2D mul(double d) { /* create and return a new point */ 
}

in order to allow nice, convenient call chains like

Point2D result = point.add(other).mul(2).add(yetAnother).mul(3);

but of course, this is horrible in terms of performance and GC overhead that it may cause. (Again, something like this would/should be covered by Escape Analysis, but investigations show that this is often not the case - at least for „less trivial“ application patterns).

A pattern that I occasionally use to alleviate this problem is to optionally allow passing in a reference that will store the result, e.g. in my „Geom“ library:
public static Point2D add(Point2D p0, Point2D p1, Point2D result)
that allows re-using existing instances, if desired.

Explicit pooling is a bit more subtle. Someone has to manage this pool. How should this be done, e.g. in the Point2D case? You’d have to keep track of the objects, and explicitly „release“ them (to put them back into the pool), and I can’t imagine how this should be used conveniently. You mentioned
take care of memory and performance will take care of itself.
This is now more true than ever. CPUs and GPUs are darn fast, but often simply have to wait for their data. So I agree that one of the keys to performance is in the memory management. But it’s far from trivial, and particularly in the case of Java. To quote a recommendation by Brian Goetz again: „Write dumb code“. The key point is: The JVM and its memory management and optimizations are tailored to work well with „usual object-oriented patterns“. Taking over the control here may interfere with optimizations that the JVM might do better than the average programmer (and if not now, then maybe in its next version). E.g. in the example of the „Callable“ that you referred to: The single object allocation will not take much time, although, of course, it may become relevant when you really submit many, many of these Callables, and they all do very little work. (Then, one could consider „grouping“ the tasks differently as well, but that’s a design issue). And even if it imposes an overhead now: This pattern became far more omnipresent with the ForkJoinPools of Java 7 and the Lambdas of Java 8, so it’s not unlikely that a lot of work will be invested into detecting and optimizing these cases in future Java versions - and then, any „manual“ optimization may become obsolete (or even have a negative effect due to cache misses when accessing pooled objects etc. - it’s difficult).

But of course, all this does not apply to all cases. For large, complex objects that are expensive to create, then this may be a different story anyhow.

On topic:

I did not continue with the benchmark yesterday (wasted some time with other stuff), but can probably continue tomorrow and during the weekend.

NigelEssence · 12. Dezember 2015 um 14:54

Still off topic, but maybe relevant to those looking for performance. I have been thinking about how to upgrade my computer to use the Titan. I have an acer tc-603, with an i7 4770 3.4 ghz and 12 gb memory. The cpu is fine, it is only worth upgrading if I starting going for dual or quad xeon, which will break my bank. I looked at refurbished dell or hp workstations, but the cpus were a bot old.

So, i bought a new case (corsair 300r) and a new power supply (corsair rm850 850w). This is because the existing ps was too marginal (500w, but probably a good model, made by fsp) and the original case was a bit tight, i I was worried about air flow.

The next problem is that the acer motherboard takes a 12 pin supply, but standard atx power supplies are 24 pin. So I am ordering a 24-12 pin converter, which will take a few days to arrive.

I will then transfer all the kit into the new case. Total cost will be about £600 GBP, which is not too bad. The original PC was maybe £500 GBP, so I have a top of the range CUDA machine for not much more than £1000.

So give me a couple of weeks and i will give a benchmark figure for the titan black.

Now, if somebody wants to give me a quad socket supermicro motherboard with 4 latest Xeons, plus 32 GB memory, then I will install and give a benchmark figure for doing it all in pure multi threaded java…

BTW the titan can use version 3 pci express 16 lane, my motherboard is 16 lane, but not sure if it is pcie v2 or v3. These things may make a difference in data transfer rate. Not sure how I can determine my motherboard pcie version, there are no manuals for it I can find.