JCublas DSYRK and DGEMM benchmark

Marco13 · 14. Dezember 2015 um 13:40

It’s hard to tell whether upgrading really makes sense. I remember upgrading my 386DX40, doubling the RAM from 4 to 8 MB, or adding a larger HDD (350 MB in addition to the original 105 MB one). But later, it did not seem worth the hassle. I try to get a well-balanced system, close to (but not totally) high end. Instead of changing the Graphics card after 2 years (potentially causing driver and compatibility issues, and potentially having an overpowered GPU for a too slow CPU), I’d rather wait another year and get a whole new system. But of course, when you absolutely focus on the GPU power and use it as a plain, raw computing device (as it might be indicated by your datamining package name), then it may be worthwhile to upgrade a single component.

However, I continued with the benchmark a little bit, and tried to apply this ““benchmarking library”” to the JCublas benchmark. (I still hesitate to call it a library - it’s mainly a bunch of classes computing the cartesian product of some input parameter ranges). The current state/idea is to have benchmark classes like

public class JCublasBenchmarkDgemm extends AbstractBenchmark
{
    JCublasBenchmarkDgemm()
    {
        super("Dgemm");
    }

    @Override
    public ValueSet execute(ValueSet inputValueSet)
    {
        int rows = inputValueSet.getAsInt("rows");
        int cols = inputValueSet.getAsInt("cols");
        int iterations = inputValueSet.getAsInt("iterations");
        
        Timing timing = testDgemm(rows, cols, cols, iterations);
        
        MutableValueSet result = ValueSets.createDefault();
        result.put("GFLOPS", timing.getGFLOPS());
        result.put("MB/s HtoD", timing.getMBPerSecondHtoD());
        result.put("MB/s DtoH", timing.getMBPerSecondDtoH());
        return result;
    }
    
    private static Timing testDgemm(
        int rowsA, int colsA, int colsB, int iterations)
    {
        ...
    }
}

and then have a few lines in the main to generate the input space and collect the results as a CSV

        List<ValueSet> valueSets = ValueSets.create()
            .interpolate("cols", 200, 1000, 50)
            .derive("rows", "cols", v -> (4000 * 200 * 200) / ((Integer)v * (Integer)v))
            .add("iterations", 25)
            .build();
        
        List<Benchmark> benchmarks = Arrays.asList(
            new JCublasBenchmarkDgemm(), new JCublasBenchmarkDsyrk());
        BenchmarkRunner benchmarkRunner = BenchmarkRunner.create();
        benchmarkRunner.runBenchmarks(benchmarks, valueSets);
        
        List<ValueSet> resultValueSets = benchmarkRunner.getResultValueSets();
        System.out.println(
            ValueSetsPrinter.createString(resultValueSets, ";"));
        ValueSetsPrinter.printCSV(resultValueSets, "results.csv");

However, I still think that there’s something odd with the SYRK benchmark. I adapted the GFLOPS computation like you said. But I’ll have to re-read your original code to understand what you did there. (The GEMM and the SYRK are somewhat unrelated, aren’t they?). I mean, you seem to have swapped the rows and columns for the SYRK case, and … using different numbers there does not really make sense for me at the moment. Or was the idea to use GEMM parameters that cause the same operation/workload to be done as with SYRK?

(BTW: This library is far from being publishable, but I’d just upload it as a plain JAR somewhere, in the case someone wants to test it)

NigelEssence · 15. Dezember 2015 um 09:17

I was trying to compare gemm and syrk on the same problem, AtA, so it is a particular use of gemm. gemm is of course more general than that.

I decided to keep my existing system as is (too many complications trying to get Acer kit into a new box, there isn;t even an Acer motherboard (mobo) manual). My other parts to build a new system arriving tomorrow. Gigabyte mobo and i7 5820k, which has 6 cores.

I know my syrk implementation is correct, I’ve checked against my Java code, plus other simple tests. I read somewhere that CUDA2 gemm is optimised, but syrk is not, so it may be worwhile doing parts of the matrix multiply using gemm. But none of my tests bears out this advice. CUDA and CUDA2 seem to have exactly the same gemm performance. Maybe they were talking about a later version?

If they had the same level of code optimisation, syrk should take about half the time of gemm for the same matrices. But of course it doesn;t work out that way. My benchmark figures, and to a greater extent yours, have syrk taking a bit longer than half.

This could be because of (a) less optimised syrk code (b) latency or data transfer not insignificant © something else!

I think the exact CUDA code to implement BLAS is confidential partly because some of these optimisations were provided confidentially. No, problem, I doubt I will ever code in CUDA directly, I just want to access the library functions through JCUDA.

Marco13 · 15. Dezember 2015 um 10:24

(It is called „JCuda“, or maybe „jcuda“ adhering to GitHub naming conventions … but there are libraries called jCUDA and JCUDA out there that have nothing to do with JCuda)

I see, now this makes a bit more sense. However, I’ll still have to review the benchmark (I changed some things during my cleanups, based on some „guesses“ about your intention…)

Certainly (if you refer to CUBLAS and CUBLAS2), there basically is no difference between both regarding the implementation. CUBLAS was introduced, and later, they refactored the API, introducing a header „cublas_v2.h“ (adding the CUBLAS „handles“), but they are both implemented in the same way - also see cuBLAS :: CUDA Toolkit Documentation
However, they certainly did optimizations to CUBLAS (and CUBLAS_v2) between the different CUDA versions - particularly, they added optimizations for newer hardware architectures.

The source code of newer CUBLAS versions is confidential. But I once stumbled over the source code of CUBLAS 1.1 (back then, I think, it was not as strictly confidential as it is now … well, I found it on some chinese website, not sure how legitimate this was … at least I won’t share significant parts of it, because… oh, yeah, who knows…). And I can say that they obviously used optimizations that are far beyond what an average programmer can reasonably do. They made heavy use of macros and all that, likely used some code generation tool for different block- and matrix sizes, transposed and non-transposed cases etc. … it’s really unreadable).

BTW: I’m not sure when I can allocate more time for the benchmarking things (there are some other rather pressing tasks right now), but I can at least pack everything into a ZIP this week and prvide it for you if you want to test it.

NigelEssence · 17. Dezember 2015 um 05:17

Marco, I don;t really expect you to spend any time on this, I am only posting because some others may be interested.

Redid the original benchmark with my Titan Black, figures below. Much faster, but maybe some overheads now being exposed? DGEMM for CUBLAS2 is consistently faster than CUBLAS.

 rows 40000 iterations 5 numTasks 12 nCols 200
CUBLAS DSYRK 111
CUBLAS DGEMM 222
CUBLAS2 DSYRK 113
CUBLAS2 DGEMM 171
rows 39602 iterations 5 numTasks 12 nCols 201
CUBLAS DSYRK 111
CUBLAS DGEMM 217
CUBLAS2 DSYRK 102
CUBLAS2 DGEMM 180
rows 39211 iterations 5 numTasks 12 nCols 202
CUBLAS DSYRK 109
CUBLAS DGEMM 214
CUBLAS2 DSYRK 118
CUBLAS2 DGEMM 163
rows 38826 iterations 5 numTasks 12 nCols 203
CUBLAS DSYRK 109
CUBLAS DGEMM 260
CUBLAS2 DSYRK 106
CUBLAS2 DGEMM 186
rows 38446 iterations 5 numTasks 12 nCols 204
CUBLAS DSYRK 110
CUBLAS DGEMM 196
CUBLAS2 DSYRK 112
CUBLAS2 DGEMM 167
rows 38072 iterations 5 numTasks 12 nCols 205
CUBLAS DSYRK 100
CUBLAS DGEMM 190
CUBLAS2 DSYRK 99
CUBLAS2 DGEMM 169
rows 37703 iterations 5 numTasks 12 nCols 206
CUBLAS DSYRK 98
CUBLAS DGEMM 213
CUBLAS2 DSYRK 114
CUBLAS2 DGEMM 159
rows 37340 iterations 5 numTasks 12 nCols 207
CUBLAS DSYRK 106
CUBLAS DGEMM 230
CUBLAS2 DSYRK 111
CUBLAS2 DGEMM 173
rows 36982 iterations 5 numTasks 12 nCols 208
CUBLAS DSYRK 102
CUBLAS DGEMM 189
CUBLAS2 DSYRK 101
CUBLAS2 DGEMM 158
rows 36629 iterations 5 numTasks 12 nCols 209
CUBLAS DSYRK 106
CUBLAS DGEMM 197
CUBLAS2 DSYRK 102
CUBLAS2 DGEMM 165

*** Edit ***

Here is performance with three times the number of rows.More or less takes three times as long, which is encouraging.

rows 120000 iterations 5 numTasks 12 nCols 200
CUBLAS DSYRK 309
CUBLAS DGEMM 633
CUBLAS2 DSYRK 313
CUBLAS2 DGEMM 523
rows 118800 iterations 5 numTasks 12 nCols 201
CUBLAS DSYRK 325
CUBLAS DGEMM 618
CUBLAS2 DSYRK 292
CUBLAS2 DGEMM 480
rows 117630 iterations 5 numTasks 12 nCols 202
CUBLAS DSYRK 309
CUBLAS DGEMM 598
CUBLAS2 DSYRK 299
CUBLAS2 DGEMM 483
rows 116470 iterations 5 numTasks 12 nCols 203
CUBLAS DSYRK 310
CUBLAS DGEMM 645
CUBLAS2 DSYRK 310
CUBLAS2 DGEMM 515

Here is the same test as above, but with the fp64 turned OFF. It shows the dramatic performance boost of using the Titan BLACK dp units, which make this board ideally suited to dp CUDA.

rows 120000 iterations 5 numTasks 12 nCols 200
CUBLAS DSYRK 807
CUBLAS DGEMM 1078
CUBLAS2 DSYRK 770
CUBLAS2 DGEMM 946
rows 118800 iterations 5 numTasks 12 nCols 201
CUBLAS DSYRK 761
CUBLAS DGEMM 1049
CUBLAS2 DSYRK 752
CUBLAS2 DGEMM 927
rows 117630 iterations 5 numTasks 12 nCols 202
CUBLAS DSYRK 760
CUBLAS DGEMM 1044
CUBLAS2 DSYRK 746
CUBLAS2 DGEMM 902
rows 116470 iterations 5 numTasks 12 nCols 203
CUBLAS DSYRK 743
CUBLAS DGEMM 990
CUBLAS2 DSYRK 738
CUBLAS2 DGEMM 888

Overall a very useful performance boost, I can now get back to my real work!

Marco13 · 17. Dezember 2015 um 05:38

Tha fact that CUBLAS2 seems to be faster than CUBLAS here is a bit surprising. I could only guess about some handwaving explanations for this (e.g. the creation of the “handle” does some one-time setup stuff, that otherwise has to be done for each new call or so). (A profiler run might bring additional insights here, but this is a bit fiddly). I’m still curious about the performance difference between single and double. Maybe I can continue with adding this to the set of benchmarks next week or during the holidays, depending on the progress with my other tasks.

system · 17. Dezember 2015 um 09:14

CUBLAS2 is only faster for dgemm, not dsyrk, which would indicate some gemm optimisations being done, and nothing to do with handles…to be honest I ran out of time to do this work a week ago, but if you want me to run some code against a Titan I can do so.

NigelEssence · 19. Dezember 2015 um 07:46

OK, some more tests. The Geforce Titan has meant that the SYRK is no longer the bottleneck, so I need to explore ways to speed up other code. I had already written a test harness to compare different Java implementation, so looked at that again. Conclusions (this is all with i7 5280k which has 6 cores, my matrix is 150,000 x 200, and multiplication results in a 200 x 200, either the full matrix or the upper triangular):

anything which has inner loops like

                 for (int k = 0; k < nRows; k++) {
                    s += a**[k] * b[k][j];
                }

has terrible performance. This is well known. I have about 80 secs on my test.

Doing tricks like copying arrays, unrolling loops etc. improves performance. I guess the cpu and version of Java are critical here. To generate upper triangular matrix, I got a performance of about 1.6 secs, down from about 3-4 secs. without these tricks.
DSYRK had an equivalent performance of about 0.2 secs. So the GPU sped up the code by factor of 10 or so.
With multi threaded Java, I doubled the speed to about 0.8 secs. So GPU is about 4 times faster than multithreaded java. I was running CPU at 3.3 GHz, which is standard, but it can be overclocked to about 4 GHz.
There are probably better ways to multi thread the code which are more efficient - at present I multithread the 200 columns, but I may be better off multithreading the 150,000 rows.
I now need to look at the non-GPU code and try to speed it up. There aren’t any CUBLAS or CuSparse methods I can use - my main bottleneck is multiplying a matrix by a diagonal matrix, and CuSparse doesn’t implement the diagonal sparse matrix storage scheme.

But even if speed up all the rest of the Java to take zero time, I will only overall speed up by a factor of 2 or so. To get big improvements will take some more radical changes, maybe with CUDA coding which I don’t really want t do.

Marco13 · 19. Dezember 2015 um 11:52

NigelEssence:

anything which has inner loops like
                 for (int k = 0; k < nRows; k++) {
                    s += a**[k] * b[k][j];
                }
has terrible performance. This is well known. I have about 80 secs on my test.

To quote you here: „Take care of memory and performance will take care of itself“ - but here not referring to memory management in the broader sense, but referring to caching. You likely know that, but the performance such a loop will heavily depend on the memory layout: In the matrix „b“, the access happens row-wise, which is just the opposite of the storage format. The difference between

for (int r=0; r<rows; r++)
{
    for (int c=0; c<cols; c++)
    {
        process(matrix[r]```);
    }
}

and

for (int c=0; c<cols; c++)
{
    for (int r=0; r<rows; r++)
    {
        process(matrix[r]```);
    }
}

can easily be an order of magnitude. (Unfortunately, it’s not always possible to arbitrarily choose the storage format of a matrix)

Doing tricks like copying arrays, unrolling loops etc. improves performance. I guess the cpu and version of Java are critical here. To generate upper triangular matrix, I got a performance of about 1.6 secs, down from about 3-4 secs. without these tricks.

With multi threaded Java, I doubled the speed to about 0.8 secs. So GPU is about 4 times faster than multithreaded java. I was running CPU at 3.3 GHz, which is standard, but it can be overclocked to about 4 GHz.

There are probably better ways to multi thread the code which are more efficient - at present I multithread the 200 columns, but I may be better off multithreading the 150,000 rows.

If these tests exist as a simple, easily testable examples, I’d be interested to see it (but of course, you don’t have to post it publicly if you don’t want to)

I now need to look at the non-GPU code and try to speed it up. There aren’t any CUBLAS or CuSparse methods I can use - my main bottleneck is multiplying a matrix by a diagonal matrix, and CuSparse doesn’t implement the diagonal sparse matrix storage scheme.

Hm. I’m not an expert here, but what do you mean that it „doesn’t implement“ it? There are serveral supported storage formats in CUSPARSE. I can’t say much about their pros and cons, but even though they are usually aiming at „general“ sparse matrices (and not diagonal ones in particular), the are at least all suitable for representing diagonal sparse matrices…

But even if speed up all the rest of the Java to take zero time, I will only overall speed up by a factor of 2 or so. To get big improvements will take some more radical changes, maybe with CUDA coding which I don’t really want t do.

Sure, Amdahl will kick in at a certain point, but depending on the application case, Gustafson may come to the rescue What are the other parts for which you are considering a dedicated CUDA solution? Of course, „real“ CUDA (own kernels) tends to involve a considerable effort (compared to just using the building blocks of the runtime libraries). But maybe there are intermediate steps that could benefit from being executed on the GPU. And even if this on its own does not bring a good speedup, it may still be advantageous, if, for example, it means that one „Device->Host copy (Java Computation) Host->Device copy“ step can be omitted.

NigelEssence · 19. Dezember 2015 um 14:30

I have been doing benchmarking of pure Java code for about 5 years, the test code is about 3600 lines long, I’m sure nobody wants it! At one time I looked at the jama library and could quite easily beat it. My testbench tries lots of different ways to code the same thing, and I took the best for my production code. I can send it privately, but be warned! When you see it, you will realise you don;t really want it.

One thing i have discovered - always check the dp setting on the Titan black, it seems a system restart resets it to sp. Windows 10 has a habit of restarting without warning, there seems to be no way to prevent this in the Home edition, really irritating when there is a upgrade and my system restarts after it has been processing for 2 days…

I have tried to multi thread my other bottlenecks (without CUDA) and they are too small to speed up - the overhead of threading is too high. I am using a ExecutorCompletionService. So I will leave it.

The big difference would be to multi thread at a higher level, but then I need to check my code is thread safe, and I will be calling CUDA from multiple threads, which is something I would rather not do as I don’t see too many people here doing it.

Basically I need a technical assistant to write all this up and continue the work, but that isn’t going to happen any time soon.

Long story short - multi threaded Java (without CUDA) is surprisingly fast if done right. CUDA is only useful for me for matrix multiplication. I just wish the compiler would use more AVX code to do things like dot product. Maybe one day.

*** Edit ***

ps by ‘doesn’t support’ I mean the DIA format, which was in CUSP but not CuSparse. I could use CuSparse, but I am running out of steam, and it probably won’t help much anyway.

Marco13 · 19. Dezember 2015 um 15:41

[QUOTE=NigelEssence]I have been doing benchmarking of pure Java code for about 5 years, the test code is about 3600 lines long, I’m sure nobody wants it! At one time I looked at the jama library and could quite easily beat it. My testbench tries lots of different ways to code the same thing, and I took the best for my production code. I can send it privately, but be warned! When you see it, you will realise you don;t really want it.
[/quote]

Well, if it’s not a simple test case, I’d probably not have/take the time to dig into it anyhow, so that’s no problem.

I was just curious, because I’m also interested in Java Performance optimizations in general. When you talk about experiments and production code, there probably are the usual trade-offs involved: Often, one can just boil down and hard-wire everything that has to be done for one particular task, solely operating on 1D double arrays etc - but any maintainability and flexibility is simply gone then, and it’s hard to tell whether a few percent of peformance are worth this. The (somewhat idealistic) idea of a nice OO design may limit the performance in the first place, but may allow to more easily switch the backend against a better one (e.g. a library that is parallelized, or runs on the GPU, or is itself backed by another native implementation again).

Therefore, for example, on one hand, I liked the approach of something like https://ujmp.org/ which is engineered to the limit. Some would certainly say it is overengineered, when comparing, e.g. the Jama Matrix class to the intimidating UJMP Matrix interface, but that’s a matter of choice.

You probably know the Java Matrix Benchmark , which did some detailed and interesting comparisons of the most well-known Java Matrix libraries for different tasks and on different hardware setups.

I have tried to multi thread my other bottlenecks (without CUDA) and they are too small to speed up - the overhead of threading is too high. I am using a ExecutorCompletionService. So I will leave it.

That was one thing that my question about the code aimed at: I was curious how you attempted the parallelization (you mentioned that you did it over the columns, but even there, there are many degrees of freedom … )

The big difference would be to multi thread at a higher level, but then I need to check my code is thread safe, and I will be calling CUDA from multiple threads, which is something I would rather not do as I don’t see too many people here doing it.

At least, it should be possible, but as usual, adds a whole new dimension of complexity. Some people are keen on pointing out the advantages of overlapping compute and memory transfer (roughly, e.g.: Copy new data to the GPU while a kernel is still running). But this is 1. not applicable to all tasks, 2. requires manual and careful handling of streams and events, and 3. I’m not sure how well this works (and, if it works, how beneficial it is) when using runtime libraries like CUBLAS.

I think using CUDA with multiple host threads is mainly interesting (or doable without many brain twists) when you have several CUDA cards in your system. Then you can just use one host thread for each one, throw some work on them, and wait until they are done, independently.

Basically I need a technical assistant to write all this up and continue the work, but that isn’t going to happen any time soon.

Long story short - multi threaded Java (without CUDA) is surprisingly fast if done right. CUDA is only useful for me for matrix multiplication. I just wish the compiler would use more AVX code to do things like dot product. Maybe one day.

Not knowing the details, this could be one of the points that I referred to before: You’d probably not do two operations with CUBLAS, but between them copy the data back to Java to just do a dot product. Although a dot product on the GPU is usually not faster than on the CPU (and maybe even slower), it would be better to keep the memory where it is.

Admittedly, I, personally, don’t care so much about things like AVX - my bigger hope ist that all these memory transfers will soon become obsolete. The first steps of sharing memory between CPUs and GPUs are done. There are some caveats, but I think that the demand will cause some stable, convenient solutions to be available here soon.

ps by ‚doesn’t support‘ I mean the DIA format, which was in CUSP but not CuSparse. I could use CuSparse, but I am running out of steam, and it probably won’t help much anyway.

I see, sorry, I didn’t know that „THE diagonal format (DIA)“ was one of these many, many special sparse matrix formats - I thought you just referred to „some“ diagonal (banded) matrix.

NigelEssence · 20. Dezember 2015 um 02:39

I have sent you privately my test code, it is more a large number of small tests rather than anything complex. Do with it what you wish.

One of my biggest problems with other Java matrix libraries is that they don’t store upper triangular matrices in a 1D array.

I did most of this work a few years ago, so may not be up to date with matrix libraries.

multA is one of the bigger bottlenecks now. I used CUDA to do AtA, but I first have to do a A = W*B, where W is diagonal, of dimension nRows, where nRows is 150,000, and it would be nice to do both operations within CUBLAS, just tranferring B and W, but I couldn’t see any easy way to do this. nCols is about 200.

In fact, I do this repeatedly with the same B but changing W. So I would really like to transfer B outside the loop, and then inside the loop transfer W and calculate AtA.

My test code has a single thread and multiple thread version of multA, but multi thread is slower than single thread.

It may be that any further discussions should be private, as nobody else seems very interested!

Marco13 · 20. Dezember 2015 um 04:00

[QUOTE=NigelEssence]I have sent you privately my test code, it is more a large number of small tests rather than anything complex. Do with it what you wish.
…
In fact, I do this repeatedly with the same B but changing W. So I would really like to transfer B outside the loop, and then inside the loop transfer W and calculate AtA.
[/QUOTE]
Thanks, I’ll try to allocate some time for this and have a look, and maybe also look for options of how this particular multiplication could be done (the CUBLAS API is quite complex, and of course, I don’t know by heard what each function is doing exactly).
My test code has a single thread and multiple thread version of multA, but multi thread is slower than single thread.

[QUOTE=NigelEssence;127288]
It may be that any further discussions should be private, as nobody else seems very interested![/QUOTE]

It depends. Any discussion that particularly refers to the code probably does not belong here, but other things (or alternative implementations) may be interesting for others as well. (Maybe I’ll split this thread somehow, but … we’ll see)

NigelEssence · 20. Dezember 2015 um 05:24

Aha. This is what I need.

c - Scaling the rows of a matrix with CUDA - Stack Overflow

There is even a JCublas2 implementation!

This is not standard Blas, but an extension.

Some of the other extensions are also worth looking at. I am solving a set on (non-linear) least squares, so cublasgelsBatched is of interest, but may not be directly applicable, given that my equations are non-linear and I have various checks for conditioning issues. For production code, robustness is important.

Let me play with this…

*** Edit ***

This thread has covered various topics:

pure Java performance and threading
Cublas
hardware and best cards for maths computation
benchmarks
probably a few others!

At least some of these topics (e.g. what card should I buy) should be of general interest.

NigelEssence · 21. Dezember 2015 um 09:32

Initial tests show that dgmm is very promising. The core of my code is:


            cudaMalloc(d_A, nRows * nCols * Sizeof.DOUBLE);
            cudaMalloc(d_B, nRows * nCols * Sizeof.DOUBLE);
            cudaMalloc(d_C, nCols * nCols * Sizeof.DOUBLE);
            cudaMalloc(d_W, nRows * Sizeof.DOUBLE);
            cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
            current = System.currentTimeMillis();
            for (int j = 0; j < nIts; j++) {
                cublasSetVector(nRows, Sizeof.DOUBLE, Pointer.to(dh_W), 1, d_W, 1);
                cublasDdgmm(handle, CUBLAS_SIDE_RIGHT, nRows, nCols, d_A, nRows, d_W, 1, d_B, nRows);
                cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, nCols, nRows, pAlpha, d_B, nRows, pBeta, d_C, nCols);
                cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);

            }

I use the same A matrix with different diagonal W matrices. So data transfer overhead is significantly reduced. Also, I can multiply by the diagonal W matrix in CUDA, not Java, which again significantly speeds it up.

Took a bit of fiddling to get all the arguments correct.

Marco13 · 21. Dezember 2015 um 12:20

It’s not yet entirely clear for me (or … maybe even less than in the beginning) what the actual task is. I.e. what the actual input, computation and output are.

So you have an input matrix B, with a size of nRowsnCols, with nRows~=150000 and nCols~=200
This matrix is symmetric? (Or at least, only the upper triangular part is relevant?)
From that, you want to compute A = WB, with W being a diagonal matrix. (And only a diagonal matrix - not a banded diagonal one!?)
Then you want to compute AtA.

If this is correct, there are certainly different options:

Using CUBLAS, trying to “emulate” the desired operations as good as possible
Writing an own kernel that does exactly this

The latter is probably out of scope. Even though one could tweak and optimize something here and there (meaning that no unnecessary FLOPs would have to be performed, and one had full control over the memory layout etc), it’s practically impossible to just “write down” such a (non-trivial) kernel and to achieve a higher performance than the CUBLAS operations, whose algorithmms have been worked out by dozens of researchers, implemented and optimized by the NVIDIA engineers, and and time-tested by thousands of users. (Apart from that, obviously, the effort for implemeting such a kernel is far higher than just calling a few CUBLAS functions…)

So the goal is likely to assmeble this process from existing CUBLAS operations. You now mentioned DGMM for the diagonal-multiplication. For a moment, I also thought that TRMM might be worth a look, but did not yet completely assemble an example here.

But for example, regarding the choice between GEMM and SYRK: You seem to have focussed on the FLOPS. But … wouldn’t it make more sense to really focus on the execution time here? I mean, GEMM might be better optimized (because it is probably THE most important BLAS operation). But if only triangular matrices are relevant, a lot of FLOPS may be wasted. To oversimplify:
A GEMM with 1000x1000 may achieve 1000^3 FLOPS
A SYRK with 1000x1000 may achieve 1000^2 FLOPS
but still the SYRK may be faster, and the (relevant parts of the) result may be the same for your case.

Did I misunderstand something here?

The basic idea (or misunderstanding) shown in a dump of the benchmark results (a CSV file)


benchmark;cols;rows ;iterations;GFLOPS    ;avg.ms  ;MB/s HtoD ;MB/s DtoH ;
    Sgemm; 200;40000;        25; 802.33435; 3.98836;2844.98462;2804.74512;
    Dgemm; 200;40000;        25; 105.98647;30.19253;2854.37769;2802.38354;
    Ssyrk; 200;40000;        25; 225.85835; 7.11951;2871.66162; 884.11902;
    Dsyrk; 200;40000;        25;  44.61617;36.04075;2872.58740;1235.73608;
    Sgemm; 250;25600;        25; 760.29749; 4.20888;2845.09985;2819.34521;
    Dgemm; 250;25600;        25; 112.95744;28.32926;2875.03857;2810.98950;
    Ssyrk; 250;25600;        25; 329.69144; 4.87244;2922.26318;1176.23572;
    Dsyrk; 250;25600;        25;  55.78395;28.79682;2870.90747;1605.80969;
    Sgemm; 300;17777;        25;1565.13940; 2.04446;2821.00854;2796.62524;
    Dgemm; 300;17777;        25; 109.72440;29.16270;2856.53857;2755.17993;
    Ssyrk; 300;17777;        25; 513.75580; 3.12456;2869.19873;1443.54846;
    Dsyrk; 300;17777;        25;  50.56953;31.74368;2855.68579;1924.65723;
    Sgemm; 350;13061;        25;1180.79785; 2.70999;2785.58569;2813.89746;
    Dgemm; 350;13061;        25; 117.19537;27.30436;2865.63818;2797.50708;
    Ssyrk; 350;13061;        25; 486.22025; 3.30003;2873.30835;1736.23242;
    Dsyrk; 350;13061;        25;  66.06841;24.28610;2876.65332;2129.87158;
    Sgemm; 400;10000;        25;2536.29443; 1.26168;2802.30249;2819.45850;
    Dgemm; 400;10000;        25; 112.71124;28.39114;2852.60962;2779.98804;
    Ssyrk; 400;10000;        25; 710.73889; 2.25681;2874.33618;1887.67529;
    Dsyrk; 400;10000;        25;  81.22978;19.74645;2820.72217;2191.51318;
    Sgemm; 450; 7901;        25;1457.62415; 2.19529;2773.25684;2791.66992;
    Dgemm; 450; 7901;        25; 108.37018;29.52754;2829.51001;2756.75366;
    Ssyrk; 450; 7901;        25; 710.18970; 2.25786;2846.58838;2141.39526;
    Dsyrk; 450; 7901;        25;  55.13606;29.08274;2852.13672;2474.80933;
    Sgemm; 500; 6400;        25;1232.43445; 2.59649;2813.35986;2781.80371;
    Dgemm; 500; 6400;        25; 114.09147;28.04767;2861.03076;2781.13184;
    Ssyrk; 500; 6400;        25; 663.95697; 2.41461;2845.44092;2244.54663;
    Dsyrk; 500; 6400;        25;  66.39425;24.14667;2854.35156;2603.55127;
    Sgemm; 550; 5289;        25;1984.29993; 1.61258;2848.46362;2744.95410;
    Dgemm; 550; 5289;        25; 110.48096;28.96286;2833.02368;2779.08862;
    Ssyrk; 550; 5289;        25;1257.37244; 1.27475;2862.79614;2395.69385;
    Dsyrk; 550; 5289;        25;  79.44069;20.17645;2825.38257;2407.81958;
    Sgemm; 600; 4444;        25;1533.64905; 2.08632;2768.87354;2708.94092;
    Dgemm; 600; 4444;        25; 115.01791;27.81897;2847.85425;2763.15088;
    Ssyrk; 600; 4444;        25; 955.81964; 1.67658;2829.78564;2535.89941;
    Dsyrk; 600; 4444;        25;  70.41306;22.75866;2804.63013;2427.75415;
    Sgemm; 650; 3786;        25;2531.01953; 1.26398;2756.39575;2737.03027;
    Dgemm; 650; 3786;        25; 112.36083;28.47229;2839.62378;2765.90137;
    Ssyrk; 650; 3786;        25;1685.35693; 0.95057;2856.33862;2566.91724;
    Dsyrk; 650; 3786;        25;  76.12366;21.04531;2848.05908;2470.48584;
    Sgemm; 700; 3265;        25;1907.63831; 1.67731;2823.50732;2738.86035;
    Dgemm; 700; 3265;        25; 113.17950;28.27102;2829.40015;2771.35522;
    Ssyrk; 700; 3265;        25;1252.22937; 1.27943;2865.71387;2615.56665;
    Dsyrk; 700; 3265;        25;  84.41212;18.97992;2838.72192;2696.95435;
    Sgemm; 750; 2844;        25;1623.76501; 1.97042;2749.96118;2710.34106;
    Dgemm; 750; 2844;        25; 114.34930;27.98006;2821.87891;2738.67407;
    Ssyrk; 750; 2844;        25;1015.52844; 1.57739;2760.49390;2388.93359;
    Dsyrk; 750; 2844;        25;  94.40007;16.96909;2834.42554;2615.80029;
    Sgemm; 800; 2500;        25;2291.23022; 1.39663;2737.64063;2734.63452;
    Dgemm; 800; 2500;        25; 117.17906;27.30863;2818.83154;2749.87524;
    Ssyrk; 800; 2500;        25;1655.47302; 0.96770;2874.49463;2480.85498;
    Dsyrk; 800; 2500;        25;  87.40066;18.32938;2835.83667;2587.52539;
    Sgemm; 850; 2214;        25;1752.56665; 1.82545;2805.66431;2685.37183;
    Dgemm; 850; 2214;        25; 110.22385;29.02484;2824.90283;2712.55127;
    Ssyrk; 850; 2214;        25;1169.42554; 1.36947;2892.54102;2549.69531;
    Dsyrk; 850; 2214;        25;  94.48727;16.94934;2832.06689;2612.65015;
    Sgemm; 900; 1975;        25;2443.75952; 1.30925;2820.10010;2711.89038;
    Dgemm; 900; 1975;        25; 114.12160;28.03588;2766.00464;2721.42651;
    Ssyrk; 900; 1975;        25;1917.46143; 0.83523;2907.13843;2520.58643;
    Dsyrk; 900; 1975;        25;  88.46268;18.10399;2875.62451;2587.71631;
    Sgemm; 950; 1772;        25;1802.35022; 1.77461;2781.79761;2707.88208;
    Dgemm; 950; 1772;        25; 109.08527;29.32073;2838.15259;2731.16895;
    Ssyrk; 950; 1772;        25;1477.81348; 1.08330;2789.51001;2697.66040;
    Dsyrk; 950; 1772;        25;  95.61117;16.74400;2877.52661;2680.67358;
    Sgemm;1000; 1600;        25;1700.23975; 1.88209;2734.23877;2707.88770;
    Dgemm;1000; 1600;        25; 114.58681;27.92643;2810.29224;2717.72656;
    Ssyrk;1000; 1600;        25;1204.80701; 1.32934;2777.56689;2751.87939;
    Dsyrk;1000; 1600;        25;  91.26176;17.54952;2802.32520;2690.42798;

For the last case:


    Dgemm;1000; 1600;        25; 114.58681;27.92643;2810.29224;2717.72656;
    Dsyrk;1000; 1600;        25;  91.26176;17.54952;2802.32520;2690.42798;

The GFLOPs of GEMM are higher, but it still takes more time than the same in SYRK - there is simply less work to do…

NigelEssence · 22. Dezember 2015 um 03:00

I think we are coming to an end of this discussion! Yes, your summary of what I am trying to do is roughly correct, I will send you some papers for the background. No, I am not interested in GFLOPS, I am interested in time. Yes, I am only interested in using existing libraries, not in coding CUDA kernels. Yes, SYRK should take roughly half the time of GEMM, as it does half the arithmetic, but the data transfer is similar, so it takes more than half the time. Yes, data transfer time is important, although I haven’t yet quantified data transfer time v. processing time.

But if you compare your times for

Dgemm; 200;40000; 25; 105.98647;30.19253;2854.37769;2802.38354;

and

Dgemm; 400;10000; 25; 112.71124;28.39114;2852.60962;2779.98804;

the number of processing operations are the same (200 x 200 x 400000 x 2 v. 400 x 400 x 10000 x 2) but the data transfer time is different (200 x 40000 x 64 bytes v 400 x 10000 x 64 bytes) but I’m not sure what your csv columns are exactly, so i don’t know whether your figures confirm this or not. It appears that the data transfer in MB/s and the GFLOPS are similar, so from that the total time can be determined. My guess is:

benchmark;cols;rows ;iterations;GFLOPS ;avg.ms ;MB/s HtoD ;MB/s DtoH ;

Dgemm; 200;40000; 25; 105.98647;30.19253;2854.37769;2802.38354;
Dgemm; 400;10000; 25; 112.71124;28.39114;2852.60962;2779.98804;

Single iteration processing data transfer total
(secs)

Dgemm; 200;40000; 0.3 0.05 0.35
Dgemm; 400;10000; 0.3 0.1 0.4

but I am probably completely wrong (by factors of 25 and/or 2), and I guess you have the timing figures? As you said some time back, extracting data transfer and processing times from my original benchmark would be useful.

Marco13 · 22. Dezember 2015 um 03:26

That was one of the reasons for the question. It may be important to know when which data has to be transferred where in order to estimate the overall efforts. Again, a bit oversimplified: It does not make sense to compute the timing for

Host → Device copy
Scale with diagonal matrix
Device → Host copy
Host → Device copy
Gemm
Device → Host copy

when the actual process (that has to compete with the pure Java timings) in the end should only be

Host → Device copy
Scale with diagonal matrix
Gemm
Device → Host copy

Maybe the papers will bring some insights here.

But if you compare your times for
…
I’m not sure what your csv columns are exactly, so i don’t know whether your figures confirm this or not. It appears that the data transfer in MB/s and the GFLOPS are similar, so from that the total time can be determined.
…
but I am probably completely wrong (by factors of 25 and/or 2), and I guess you have the timing figures? As you said some time back, extracting data transfer and processing times from my original benchmark would be useful.

The columns do not (yet) contain the total time - and also not the time for the individual memory copies. The GFLOPs and the MB/s columns may be what is interesting when someone wants to compare the hardware. The MB/s should in most cases be similar for „reasonably large“ memory chunks, but also see the Bandwidth Test at http://jcuda.org/samples/JCudaBandwidthTest.java .
In the last table, I added the „avg.ms“ column, which is the average time for a single execution of the operation. I’ll add some more timings (for the memory transfers, and the total (average) time), because these are likely more interesting when comparing the algorithmic approaches.

NigelEssence · 22. Dezember 2015 um 05:14

We’re getting there!

In the actual algorithm, it is more like:

Host -> Device copy large matrix A 200 x 150,000 (CUDA)
FOR i = 1 to 5 DO
do some stuff (Java)
construct W weighting diagonal matrix (1 x 150,000) (Java)
Scale A with diagonal matrix W using gmm (400 x 150,000 operations) (CUDA)
construct upper triangular matrix using syrk (150,000 x 400 x 400 /2 operations) (CUDA)
Device -> Host copy upper triangular (400 x 400)
factorise upper triangular matrix (Java)
back substitute (Java)
do some stuff (Java)
ENDDO

so being able to move the large matrix copy host to device will make a big difference.

Let me go implement this in my main code.

Marco13 · 22. Dezember 2015 um 05:49

(Not sure when I’ll have a chance to invest more time here, so just a short note: I noticed some factorization snippets in the code that you provided, and now you again mentioned the factorization - did you see the JCudaMatrixCgSample20141023.zip and the JCublasMatrixInvert.java on the samples page? They may contain some snippets that may be useful here, although I certainly would have to refresh my knowledge about what they are doing internally - just a small pointer, maybe it’s interesting for you)

NigelEssence · 22. Dezember 2015 um 09:53

It’s OK, contrary to ‘common wisdom’, which you will find all over the place in books and articles, factorisation is not necessarily the bottleneck. In this particular case it is far from the bottleneck. For 150,000 x 200, to construct the matrix takes 2 x 150,000 x 200 x 200 /2 = 6,000,000,000 operations, to factorise takes (i think) 200 x 200 x 200 /6 = 1,300,000 operations.

You often see this misconception.

Maybe one of the topics is ‘how do you optimise code’. The process is:

While you still have time and patience and enthusiasm {

find the bottleneck through experimentation and measurement
optimise the bottleneck
}

Note that profilers can be VERY misleading, some Java compilers seem to inline code. I always use explicit code to gather timings, it is the only reliable way (unless anybody else knows better). My IDE is Netbeans. Similarly, you have to be careful what libraries like JCuda are doing - don;t assume that when a method returns, all the work has been done.

Of course (as you all know from books like Programming Pearls, and articles by Knuth and others), think long and hard about the data structures you are going to use.

In my 40 years of software development, I can hardly recall any occasion when my initial guess at the bottleneck has been correct. Sometimes in Java it is an obscure synchronisation which is the problem, deep inside some library method.I had this once with Jide charting library (which is excellent) , and they were kind enough to supply me with a revised version which removed the problem.

ps. the JCudas library implements some Cublas extensions which have some factorisation and backsubstitution methods.

See cuBLAS :: CUDA Toolkit Documentation