[QUOTE=NigelEssence]It’s OK, contrary to ‚common wisdom‘, which you will find all over the place in books and articles, factorisation is not necessarily the bottleneck. In this particular case it is far from the bottleneck. For 150,000 x 200, to construct the matrix takes 2 x 150,000 x 200 x 200 /2 = 6,000,000,000 operations, to factorise takes (i think) 200 x 200 x 200 /6 = 1,300,000 operations.
[/quote]
OK, sorry, I see, the factorization only applies to the (small) result matrix, so there’s probably not so much to gain.
Note that profilers can be VERY misleading, some Java compilers seem to inline code. I always use explicit code to gather timings, it is the only reliable way (unless anybody else knows better). My IDE is Netbeans. Similarly, you have to be careful what libraries like JCuda are doing - don;t assume that when a method returns, all the work has been done.
The latter is part of the API spec. There are explicit ...Async
functions, and … with CUDA 5, I think,… they added support for streams to the runtime libraries. So in order to obtain timing information, one always has to use some form of synchronizers.
But the Java part confuses me. The Java Compiler itself does basically no optimization at all. You can look at the bytecode of a class with javap -c MyClass.class
. Buf of course, the Hotspot Just-In-Time-Compiler does massive optimizations (inlining and unrolling being the „least crazy“ ones here). This can lead to quite unexpected results, and care has to be taken that timings in artificial benchmarks are not distorted by some JIT optimizations (see java - Why is „while (i++ < n) {}“ significantly slower than „while (++i < n) {}“ - Stack Overflow , or more recently, Java integer ++i not changing the value - Stack Overflow ). Things like this are to some extent covered by benchmarking frameworks like Caliper or JMH. (But even a „simple“ Microbenchmark can give rough indications for the performance, when some basic rules are kept in mind). Are you using any profilers, beyond the manual „curentTimeMillis“ timing? For a while, I thought that JVisualVM was the only „real“ option here (unless you’d like to spend big $$$s for such a profiler). But recently, Oracle has made Java Mission Control freely available, and it looks really interesting (I have not yet used it extensively, but the profiler seems to produce much more detailed and reliable results than JVisualVM (it might be a wrong impression, but I’ll definitely test it more throughoutly).
However, the confusing part: You mentioned
I always use explicit code…
I hope this does not mean that you are running this with -Xint
, are you? This, I think, also distorts the results massively. Nobody is using a JVM that runs purely in interpreted mode…
Of course (as you all know from books like Programming Pearls, and articles by Knuth and others), think long and hard about the data structures you are going to use.
Sure. Inlining or unrolling may bring a few % here and there, but choosing the right data structures and algorithms can make the real difference - even beyond obvious O(n^2)-vs-O(nlogn) choices. (Nevertheless, this „optimization loop“ that you mentioned is somehow rewarding, and even if a different approach may sometimes be more beneficial, it may be fun to see how to squeeze out the last % for a specific implementation).