I think that Gary made his points clear in the response to your issue, particularly referring to Aparapi and its role and goals, so I won’t say to much in this regard.
But concerning the more general, long-term statements that you (and he) alluded to in this discussion, I tend to agree with him. I know that there is some sort of a „controversy“ in a larger scope. Slightly exaggerated: On one side, there are the HPC guys, who want to know each and every bit in their computer, and want to push the actual FLOPS from 82.4% to 83.2% of the theoretical FLOPS using some artificial benchmark (or maybe even „serious“ large-scale number crunching applications). On the other side, there are the managers who read some headlines about „…SOA, Cloud Computing, Big Data and … hey, what’s that? ‚Parallel Programming‘? That sounds cool. Let Joe Developer do this for our company“.
The point is that parallel programming has become mainstream, now that every cell phone has a quad core CPU. But it is (and probably always has been) a fact that the development of „tools“ (that is, programming languages) in a broader sense has a much higher inertia than the hardware developments (although it might turn out that history strikes back, and the ideas that shaped COBOL, Fortran and (especially (things like (the (functional) language))) LISP may have a revival ;)).
In contrast to what you proposed, I think that architectural details of the hardware design HAVE to be hidden to enable this broader application of parallel programming. For several reasons. The most important ones are IMHO 1. The heterogenity of the devices and 2. the productivity of the programmer.
It is no longer the case that one company buys one CRAY workstation and writes its own, proprietary, highly optimized and perfectly tailored in-house software that will serve as the basis for the business model for 10 or 20 years. Software today has to run in parallel and should exploit ALL available processing resources, on any device, no matter whether it’s a cluster of 4 NVIDIA GPUs, the 2 cores of a cell phone, or any mixture thereof. This applies to software in general, but especially (strikingly) for a ‚software‘ like the Java Virtual Machine!
While you’re certainly right when you suggested that a lack of knowledge about the target architecture may lead to poor performance, I’m sure that the solution for this is NOT to expose the details of the architecture. Instead, the programming models have to be changed, in order to allow an abstraction, and basically „force“ the programmer to use parallel patterns (and in the best case, give him the feeling that he is not forced to do something in a particular way, but instead has the freedom to do so ;)).
The translation from the high-level, abstract, parallel language constructs should then be left to the compiler. Even a senior programmer who knows his stuff can’t keep pace with the hardware developments. I think you can not expect someone to be a productive application developer, and at the same time know the details of the (all?) target architectures. The latter is the job of those who are providing the compilers. THEY should do this, and they should do this RIGHT. (And that’s only one of the reasons why I think that JITed languages like Java will soon outperform any hand-optimized C implementation for most applications, except maybe some specific HPC applications).
As an example: How would you implement a reduction in CUDA? Well, maybe you already leared a lot about memory coalescing, shared memory and all the stuff that is mentioned in the CUDA best practices guide. But even if you are a highly-trained CUDA professional: I’d bet that you would never-ever implement it like it is described on slide 35 the corresponding NVIDIA whitepaper: http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf . I mean,
seriously?
[spoiler]
template <unsigned int blockSize>
__device__ void warpReduce(volatileint *sdata, unsigned int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
template <unsigned int blockSize>
__global__ voidreduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){sdata[tid] += g_idata** + g_idata[i+blockSize]; i += gridSize; }
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32)warpReduce(sdata, tid);
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
[/spoiler]
I don’t want to write something like this. What’s all that 32, 64, 128 stuff? Yeah, it has to do with the warps and coalescing … and for the next GPU generation, you might find out that this implementation is particularly slow or otherwise insufficient, and can not be adjusted to the new architecture (at least, certainly not by someone who NOT wrote the original implementation), so you have to start from scratch. Code like this should not appear in a „normal“ application. In a „normal“ application, this code should should read as
Magic.reduce(inputArray, outputArray);
and choosing the particular implementation for a CUDA GPU with compute capability 2.0, fused MADD, a warp size of 32 and 1024 compute units should be done - by the JIT. At runtime. And on another device, another implementation (c/sh)ould be used.
You also proposed some „limited subset“ of Java. But you can’t enforce that. For example, you can not prevent someone from doing a reduction of 2 arrays like in
outputA[i+1] = outputA**+inputA**;
outputB[i+1] = outputB**+inputB**;
}
And although this is still a very „simple“ dummy example, it’s clear that this is harder to analyze and tear apart (and parallelize, automatically OR manually) than if he had just written
reduce(outputB, inputB);
I think that the functional concepts of Java 8 (Lambdas and Streams) will help to gradually change the style of how people write programs, and the teaching in universities will increasingly encourage people to really think in parallel constructs (and not focus on details that become obsolete with the next generation of GPUs).
BTW: I’ll probably split this part of the discussion into another thread - it’s not directly related to a deadlock But ATM, I’m suffering from some severe sleep deprivation, so (excuse any typos and) I’ll do this tomorrow.