Java, GPUs and parallel programming in general

alexd457 · 17. Januar 2014 um 09:20

EDIT: This thread contains the part of the discussion from http://forum.byte-welt.net/threads/11248-Deadlock that diverged from the orignal topic. The respective posts have been copied and split, but no text has been modified. The discussion started with general statements about JCuda and its alternatives:

…

I really appreciate you putting together this library. I used to meddle in C++ and CUDA, but haven’t touched them for years, instead working in C# and then Java. When I had a project that required CUDA, I tried going back to C++. Yet, things that I’ve come to think of as basic OOP were missing (esp since VS10 doesn’t support C++11). Lots of other minute rules and crud was everywhere. I felt like I was in bizarro world, and couldn’t write the code I wanted. Then, JCuda came along. It’s turned out to be really great. I can’t imagine why anyone would want to write CUDA in C++. Thank you!

I’ve also tried JavaCL and that seemed great too, but I don’t have a background in OpenCL, was confused by the availability of BLAS libraries, and read that I wouldn’t be able to debug the kernel. I did like, though, how it came with a Maven plugin for compiling kernels and creating Java wrappers. I wrote my own Maven plugin that takes care of compilation, but create wrappers manually. I also checked out Aparappi, and concluded that the authors had the wrong goals. The concept (compiling java to gpu bytecode) I think is good, but they didn’t share the emphasis on performance. They want to hide essential architectural details and make it “easy.” In fact, I think they (ie, AMD) want to pipe pure Java through the GPU, such as standard Java arrays-of-pointers-to-objects. Of course, that is ridiculous from a performance perspective (eg, no memory coallescing on any level).

Marco13 · 17. Januar 2014 um 10:40

I’ve also tried JavaCL and that seemed great too, but I don’t have a background in OpenCL, was confused by the availability of BLAS libraries, and read that I wouldn’t be able to debug the kernel. I did like, though, how it came with a Maven plugin for compiling kernels and creating Java wrappers. I wrote my own Maven plugin that takes care of compilation, but create wrappers manually. I also checked out Aparappi, and concluded that the authors had the wrong goals. The concept (compiling java to gpu bytecode) I think is good, but they didn’t share the emphasis on performance. They want to hide essential architectural details and make it „easy.“ In fact, I think they (ie, AMD) want to pipe pure Java through the GPU, such as standard Java arrays-of-pointers-to-objects. Of course, that is ridiculous from a performance perspective (eg, no memory coallescing on any level).

Sure, there are some alternatives available in this quickly evolving field. I think that in the long term, OpenCL will prevail. CUDA is still more popular for several reasons - apart from the strong and aggressive promotion by NVIDIA, of course - but in a direct comparison to OpenCL, I think that the disadvantage of being bound to GPUs and a certain vendor may let it seem less attractive for new/future projects. However, I also think that CUDA has one really invaluable asset: Namely the runtime libraries, CUBLAS, CUFFT etc. Something comparable to this is still lacking for OpenCL, although, of course, people are working on that.

Concerning Aparapi: I would not say that their goals have been wrong. They tried something really new and innovative, with a high risk of failing - but they did not fail (although I have not yet used Aparapi extensively or for larger projects, and thus can not objectively judge its practical applicability). They are also not focussing on the approach that you mentioned. Of course, this duality is one of the major hurdles for something like „Java on the GPU“: In Java, you usually have an „Array of Structures“, like a List<Person> persons, whereas in GPU programming, you’d generally prefer a „Structure of Arrays“ for performance reasons, like in

class Persons {
    float ages[] = new float[10];
    float heights[] = new float[10];
    float weights[] = new float[10];
}

They are aware of this duality, and present some hints of how to cope with this on Google Code Archive - Long-term storage for Google Code Project Hosting. . However, they recently also worked on what you described (mentioned in Google Code Archive - Long-term storage for Google Code Project Hosting. ), but are also pointing out the difficulties and performance impacts. I think that the goal of being able to handle Arrays of Objects was partially inspired by Rootbeer https://github.com/pcpratts/rootbeer1 , which did something like this for CUDA (in a form that I personally find at least adventurous - not implying any objective judgement).

In any case, the main developer of Aparapi (that is: Gary Frost from AMD) is also involved in the project Sumatra at OpenJDK: Project Sumatra , where (to oversimplify it a little) the concepts from Aparapi and Rootbeer should be included into the Just-In-Time-Compiler of the JVM. In the mailing list, they referred to Aparapi as sort of a ‚tracing bullet‘ for the development. But of course, a JIT compilation with all the infrastructure of the JIT and JVM in the background will offer VERY interesting new possibilities. It will certainly still take a while until this reaches the end-user, but looks very promising, IMHO.

alexd457 · 17. Januar 2014 um 11:20

Project Sumatra (and the Heterogenous System Architecture bytecode which enables it) is what I referred to. I’ve talked with Gary Frost in the Aparapi forums. http://code.google.com/p/aparapi/issues/detail?id=130 He made clear that he’s in favor of the “simplifying” approach and hiding the GPUs’ architecture. He would like to see ordinary, idiomatic Java code running on GPUs, which naturally implies things like Arrays of Objects, even garbage collection, etc. Of course, none of that maps well to GPUs, and never will. The reason is that although GPUs incorporate some fantastic ideas versus classical vector processors (ie, automatic divergence, scatter-gather), they are still vector processors and perform well only when given vector workloads. AMD wants people to forget this fact, and write very crappy code. This is bound to disappoint, not “enable”, developers, who would be much happier to write in a more limited subset of Java that more consistently gave good results.

Marco13 · 17. Januar 2014 um 17:32

I think that Gary made his points clear in the response to your issue, particularly referring to Aparapi and its role and goals, so I won’t say to much in this regard.

But concerning the more general, long-term statements that you (and he) alluded to in this discussion, I tend to agree with him. I know that there is some sort of a „controversy“ in a larger scope. Slightly exaggerated: On one side, there are the HPC guys, who want to know each and every bit in their computer, and want to push the actual FLOPS from 82.4% to 83.2% of the theoretical FLOPS using some artificial benchmark (or maybe even „serious“ large-scale number crunching applications). On the other side, there are the managers who read some headlines about „…SOA, Cloud Computing, Big Data and … hey, what’s that? ‚Parallel Programming‘? That sounds cool. Let Joe Developer do this for our company“.

The point is that parallel programming has become mainstream, now that every cell phone has a quad core CPU. But it is (and probably always has been) a fact that the development of „tools“ (that is, programming languages) in a broader sense has a much higher inertia than the hardware developments (although it might turn out that history strikes back, and the ideas that shaped COBOL, Fortran and (especially (things like (the (functional) language))) LISP may have a revival ;)).

In contrast to what you proposed, I think that architectural details of the hardware design HAVE to be hidden to enable this broader application of parallel programming. For several reasons. The most important ones are IMHO 1. The heterogenity of the devices and 2. the productivity of the programmer.

It is no longer the case that one company buys one CRAY workstation and writes its own, proprietary, highly optimized and perfectly tailored in-house software that will serve as the basis for the business model for 10 or 20 years. Software today has to run in parallel and should exploit ALL available processing resources, on any device, no matter whether it’s a cluster of 4 NVIDIA GPUs, the 2 cores of a cell phone, or any mixture thereof. This applies to software in general, but especially (strikingly) for a ‚software‘ like the Java Virtual Machine!

While you’re certainly right when you suggested that a lack of knowledge about the target architecture may lead to poor performance, I’m sure that the solution for this is NOT to expose the details of the architecture. Instead, the programming models have to be changed, in order to allow an abstraction, and basically „force“ the programmer to use parallel patterns (and in the best case, give him the feeling that he is not forced to do something in a particular way, but instead has the freedom to do so ;)).

The translation from the high-level, abstract, parallel language constructs should then be left to the compiler. Even a senior programmer who knows his stuff can’t keep pace with the hardware developments. I think you can not expect someone to be a productive application developer, and at the same time know the details of the (all?) target architectures. The latter is the job of those who are providing the compilers. THEY should do this, and they should do this RIGHT. (And that’s only one of the reasons why I think that JITed languages like Java will soon outperform any hand-optimized C implementation for most applications, except maybe some specific HPC applications).

As an example: How would you implement a reduction in CUDA? Well, maybe you already leared a lot about memory coalescing, shared memory and all the stuff that is mentioned in the CUDA best practices guide. But even if you are a highly-trained CUDA professional: I’d bet that you would never-ever implement it like it is described on slide 35 the corresponding NVIDIA whitepaper: http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf . I mean,

seriously?
[spoiler]

template <unsigned int blockSize>
__device__ void warpReduce(volatileint *sdata, unsigned int tid) {
    if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
    if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
    if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
    if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
    if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
    if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
template <unsigned int blockSize>
__global__ voidreduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;
    while (i < n){sdata[tid] += g_idata** + g_idata[i+blockSize]; i += gridSize; }
    __syncthreads();
    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
    if (tid < 32)warpReduce(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

[/spoiler]

I don’t want to write something like this. What’s all that 32, 64, 128 stuff? Yeah, it has to do with the warps and coalescing … and for the next GPU generation, you might find out that this implementation is particularly slow or otherwise insufficient, and can not be adjusted to the new architecture (at least, certainly not by someone who NOT wrote the original implementation), so you have to start from scratch. Code like this should not appear in a „normal“ application. In a „normal“ application, this code should should read as
Magic.reduce(inputArray, outputArray);
and choosing the particular implementation for a CUDA GPU with compute capability 2.0, fused MADD, a warp size of 32 and 1024 compute units should be done - by the JIT. At runtime. And on another device, another implementation (c/sh)ould be used.

You also proposed some „limited subset“ of Java. But you can’t enforce that. For example, you can not prevent someone from doing a reduction of 2 arrays like in

    outputA[i+1] = outputA**+inputA**;
    outputB[i+1] = outputB**+inputB**;
}

And although this is still a very „simple“ dummy example, it’s clear that this is harder to analyze and tear apart (and parallelize, automatically OR manually) than if he had just written

reduce(outputB, inputB);

I think that the functional concepts of Java 8 (Lambdas and Streams) will help to gradually change the style of how people write programs, and the teaching in universities will increasingly encourage people to really think in parallel constructs (and not focus on details that become obsolete with the next generation of GPUs).

BTW: I’ll probably split this part of the discussion into another thread - it’s not directly related to a deadlock But ATM, I’m suffering from some severe sleep deprivation, so (excuse any typos and) I’ll do this tomorrow.

alexd457 · 18. Januar 2014 um 04:13

While I am a proponent of simplification, elegance, and ease of use, there are “physical” limits to them. You can’t simplify beyond them.

For example, you can’t get rid of memcpy calls, which is something that even OpenCL tries to do. Yes, it’d be nice. Yes, there are devices (ie, integrated GPUs) that don’t need them. But external GPUs do need them, very much so. And, external GPUs will be what people who care about performance will use AND what will makes money for AMD. Trying to get rid of memcopies is counter-productive for everyone. And it’s not like it’s that hard to use them! Although, as an alternative, I propose to use “shadow buffers” that expose both a CPU and GPU memory region (even if they map the same memory), and support commands to synchronize them. On integrated GPUs, these commands become no-ops, while on external GPUs they are memcopies.

By the same token, you can’t hide things like warp topology, shared memory, register memory, the need to minimize divergence and maximize memory coallescing. They’re not “hard” like programming assembly (or C++) is hard. They’re not “hard” like manually managing memory is hard. They simply come with the turf. But most importantly, they are what let GPU processors be fast. They are the engineering tradeoff that let GPUs be fast vs CPUs.

You bring up the example of implementing an efficient reduction. What is the alternative? Can a compiler generate it automatically? If so, I’ll swallow my words! But that seems fantasy, for now. Should a programmer use a library? Sure! But what tools does the library author have? Or should they not use the same tools? The third alternative is to write a bad reduction, and blog about how GPUs are “over-hyped.” Is the third alternative desirable? What is the point?

Btw, what is most fn ironic regarding the JVM in the context of Project Sumatra and Java 8 Streams (which I also hate), is how Oracle has missed the lowest-lying fruit. Give us auto-vectorization! It is the epitome of easy-to-use, gives huge performance gains (even more than multi-core parallelism or perhaps GPUs), and is well-established and universally loved.

Marco13 · 18. Januar 2014 um 10:39

First, as an aside: From you description (also in Google Code Archive - Long-term storage for Google Code Project Hosting. ) it sounded like OpenCL and CUDA were quite different. But in fact, many concepts are rather similar. The focus on GPUs may be more prominent and more visible in CUDA, but in many cases, it is basically possible to make a 1:1 translation between CUDA and OpenCL (even of the kernels). Maybe you referred to the C++ support, which is available for OpenCL only in form of some „convenience wrappers“, but the C APIs are largely equivalent.

I also don’t understand what you mean when you say that OpenCL wants to get rid of memory copies: They refer to the memory as „cl_mem“, or „Buffers“, and in fact their handling has some similarities to the „shadow buffers“ that you proposed. One allocates memory for one context, but does not know on which device of this context the memory resides. I think that this might even impose some difficulties: When you perform a sequence of operations, the implementation may be forced to copy the memory between the devices several times.

By the same token, you can’t hide things like warp topology, shared memory, register memory, the need to minimize divergence and maximize memory coallescing. They’re not „hard“ like programming assembly (or C++) is hard. They’re not „hard“ like manually managing memory is hard. They simply come with the turf. But most importantly, they are what let GPU processors be fast. They are the engineering tradeoff that let GPUs be fast vs CPUs.

At this point, I basically agree with you. Namely that at the moment, a significant portion of the performance benefits of GPUs comes from the programs being tailored for them. This mainly refers to the data locality that you mentioned, and the knowledge about the internals. One could imagine that the performance of CPUs could be increased dramatically for certain applications IF the programmer had the possibility to manually manage the L1/L2 caches. But again, I think that this manual management comes at a (IMHO unacceptably) high cost in terms of flexibility and productivity. Every degree of „freedom“ in programming that is exposed from the hardware to the programmer increases the complexity and reduces the sustainability of code, because the code will be specific for one device.

You bring up the example of implementing an efficient reduction. What is the alternative? Can a compiler generate it automatically? If so, I’ll swallow my words! But that seems fantasy, for now. Should a programmer use a library? Sure! But what tools does the library author have? Or should they not use the same tools? The third alternative is to write a bad reduction, and blog about how GPUs are „over-hyped.“ Is the third alternative desirable? What is the point?

Btw, what is most fn ironic regarding the JVM in the context of Project Sumatra and Java 8 Streams (which I also hate), is how Oracle has missed the lowest-lying fruit. Give us auto-vectorization! It is the epitome of easy-to-use, gives huge performance gains (even more than multi-core parallelism or perhaps GPUs), and is well-established and universally loved.

I’m not a compiler expert, so everything that I say now should be taken with a grain of salt, but as far as I know, something like auto-vectorization is the most simple form of „optimization for parallelization“. However, a lot of research is done on the other fields. Concerning the reduction, for example: It is not clear from „what“ the compiler should be able to generate the reduction code that I posted above. Generating this code automatically from something like

for (int i=0; i<n-1; i++) outputA[i+1] = outputA**+inputA**;

is certainly close to impossible. I have been talking to some other developers, and from what I heard so far, the „automatic use of shared memory“ is the holy grail here. While it is possible to detect the applicability of shared memory for certain cases, there are always cases where this automatic detection fails, and the negative impact on performance in these cases is so large that it eats up all other performance gains. But of course, a more „library-style“ approach like in Library.reduce(out,in) offers these possibilties. And things like streams and lamdas will change the style of programming in a way that will allow a broader range of automatic optimizations to be applied.

In any case, it should be clear that exploiting highly specific characteristics of the hardware will always allow you to squeeze out the last bit of performance, whereas any more „high-level“ programming approach will probably always miss certain opportunities for optimization. I think that (especially for something like the Java Virtual Machine) versatility, flexibility and the possibility to adapt to future hardware is more important than a few percent of additional performance on todays hardware. Especially when squeezing out these few percent forces you to write code in a form that is SO specific that it will become useless even due to the slightest change in the hardware. The experts for JIT compilation and runtime-optimization are not sleeping, aware of the challenges, and working on solutions, and I’m sure that parallel programming HAS to become easier and less hardware-specific, and that it WILL become easier.

I have always been impressed by NESL: http://www.cs.cmu.edu/~scandal/nesl.html It’s hard to believe, but this is already 20 years old! I always thought that the PhD thesis of Guy Blelloch already answered many questions that we are facing today, in view of mainstream GPU programming. And one example on their website supports my points in the most obvious form: They implemented a parallel Quicksort, once in NESL and once with MPI. In NESL, this is 10 lines of easy readable code (looks like pseudocode, actually). In MPI, it is 1700 lines of code that are practically impossible to maintain (see Making Parallel Programming Easy and Portable ). Now the question is: How large should a performance benefit of the MPI version be in order to justify this effort? It’s always a trade-off. But when I can write 10 lines of simple code in 10 minutes and achieve 50% of the theoretical peak performance on any device, I’ll probably always prefer this in contrast to spending 2 weeks with writing 1700 lines of code in order to achieve 60% of the peak performance on one specific device…

But maybe, to some extent, it is just a matter of (subjective) priorities…

alexd457 · 21. Januar 2014 um 04:18

I was indeed comparing the OpenCL API to the CUDA so-called “Runtime API” (which is, of course, a new language, not a “C++ API”). I don’t think it makes sense comparing OpenCL to the Driver API, because noone should be using the Driver API, except to, eg, provide binding for a different language.

To be honest, I haven’t used OpenCL, and I assume there are ways to coordinate memory copies like in CUDA, but the intent of the API designers, at least, was to eliminate memory copies from the API (and inject them automatically at runtime). Of course, as you point out, this causes needless confusion and problems. I propose the opposite. The programmer would insert all the memory copies via the API, but the runtime could turn them into no-ops at runtime.


At this point, I basically agree with you. Namely that at the moment, a significant portion of the performance benefits of GPUs comes from the programs being tailored for them. This mainly refers to the data locality that you mentioned, and the knowledge about the internals. One could imagine that the performance of CPUs could be increased dramatically for certain applications IF the programmer had the possibility to manually manage the L1/L2 caches. But again, I think that this manual management comes at a (IMHO unacceptably) high cost in terms of flexibility and productivity. Every degree of "freedom" in programming that is exposed from the hardware to the programmer increases the complexity and reduces the sustainability of code, because the code will be specific for one device.

No, a CPU does not benefit the same way if the programmer manually manages memory locality (eg, if the caches are accessed like SRAM). There may be incremental increases, but not order-of-magnitude increases.

There is a fundamental difference between CPUs and GPUs. CPUs (ignoring SSE) are scalar devices. GPUs are vector devices. A vector device is more transistor-efficient, because the same amount of “overhead” transistors control many more “work” transistors. But for this to actually increase performance, the work has to be structured correctly, whether explicitly or implicitly.

For example, a memory bus needs a certain number of wires for addressing, commands, and status messages, in addition to the wires used for passing data. The overhead wires may amount to 50-100. If there are 32 data wires, you suffer from a lot of overhead. If there are 512 wires, the overhead is much smaller. But if the 512 wires are controlled by only one set of address/command wires, they can only read from one, contiguous, probably aligned, 512-bit memory location at a time. This is the reason for memory coallescing requirements.

GPUs have already made giant leaps over earlier vector architectures in terms of flexibility. In traditional SIMD, the programmer has to manually manage everything. SIMT was a huge breakthrough, letting the hardware handle divergence. The programmer didn’t have to hand-code the corner cases when not all the lanes of the vector perform the same operation (if the SIMD instructions even allowed predicates, which most did not!). SIMT performs one operation on one subset of vector elements, and another operation on the other, automatically, as determined by necessity at runtime. Yet the vector unit has to do multiple passes, because it cannot perform different operations on different elements at the same time (or it would need many more control transistors). Early GPGPUs (like the 8800) had very strict memory coallescing requirements, performing very inefficiently otherwise. Modern ones dispatch the minimum number of accesses necessary. Yet the memory unit still has to do multiple passes for non-coallesced memory, because it cannot address different memory locations at the same time (or it would need many more control wires). Perhaps further innovations are possible (such as replacing shared memory with automatic caches), but GPUs will always fundamentally be vector machines.

In fact, these innovations came with costs and overheads. Additional transistors were used to manage divergence, multiple memory requests, or caches. These overheads were incremental, somewhat slowing down the chips while greatly easing programming. These overheads are perfectly acceptable.

However, simplifying much further, ignoring the vector nature of these chips, leads to cliff-like falls in performance. We can’t let programmers write code that accesses arrays-of-pointers-to-structs (common in Java), and expect for that to be ok. It’s not “a few percent of lost performance on today’s hardware.” It’s orders of magnitude of lost performance on all future hardware.

I suppose it might be possible for the JIT to re-organize such an array into a struct-of-arrays in certain situations (certainly not all), but that would lead to even more unwritten rules and unpredictable performance cliffs.

What for? Is it actually difficult for a Java programmer to use a struct-of-arrays? Is it that difficult for a good Java programmer to grok warps and blocks? All that matters is that they can program in the familiar Java syntax. Mastering a few larger concepts is not hard. All parallel programming–nay, all programming–requires understanding concepts. It’s not like NESL doesn’t require understanding concepts.

*** Edit ***

Cuda is already a lot like nesl, in the sense that the efficient reduction code is a handful of lines, and not 1700 of MPI or SSE intrinsics.

Marco13 · 21. Januar 2014 um 09:42

First of all, another disclaimer: I’m not a CUDA expert, not a GPU expert, and even less a hardware expert. Of course, one acquires some knowledge when creating something like JCuda, but - as usual - this knowledge mainly consists of the insight: There are MANY things that I still do NOT know

[QUOTE=alexd457]I was indeed comparing the OpenCL API to the CUDA so-called „Runtime API“ (which is, of course, a new language, not a „C++ API“). I don’t think it makes sense comparing OpenCL to the Driver API, because noone should be using the Driver API, except to, eg, provide binding for a different language.
[/quote]

Well, even the runtime- and the driver API are nearly equal. The runtime API allows constructs like the <<>> launching syntax, and the driver API allows manual loading of modules and kernel execution, but apart from this single aspect, they are equal (which I already once sketched here). For JCuda (and probably other language bindings), this is somewhat unfortunate: People will tend to use the „„simpler““ runtime API, and then notice that they have to use the driver API to launch kernels. Fortunately, since CUDA 3.0, both APIs are at least interoperable.

For example, a memory bus needs a certain number of wires for addressing, commands, and status messages, …
The programmer didn’t have to hand-code the corner cases when not all the lanes of the vector perform the same operation (if the SIMD instructions even allowed predicates, which most did not!)…
Early GPGPUs (like the 8800) had very strict memory coallescing requirements, performing very inefficiently otherwise. Modern ones dispatch the minimum number of accesses necessary. Yet the memory unit still has to do multiple passes for non-coallesced memory, because it cannot address different memory locations at the same time (or it would need many more control wires). Perhaps further innovations are possible (such as replacing shared memory with automatic caches), but GPUs will always fundamentally be vector machines.

Although I know that these aspects are important to achieve peak performance, some of these things (like the hardware-specific memory wiring) are what I do not know in detail, and (admittedly), I do not want to have to know them: They are changing quickly. You already mentioned things like the predicated execution, and the relaxed coalescing requirements in post-8800-GPUs. Learning the 8800-requirements and painstakingly obeying them when writing code CAN only have made the code more complicated, and on newer GPUs, these efforts may turn out to be (nearly) useless. There will certainly be more of these „simplifications“ or abstractions on future GPUs. For example, the newest GPUs already have the option to simply declare (some of) the shared memory as a „cache“. This basically means „hey, GPU, I don’t want to take care of shared memory - you may use it and try to make the best out of it“.

So in basic agreement with your statement…

In fact, these innovations came with costs and overheads. Additional transistors were used to manage divergence, multiple memory requests, or caches. These overheads were incremental, somewhat slowing down the chips while greatly easing programming. These overheads are perfectly acceptable.
However, simplifying much further, ignoring the vector nature of these chips, leads to cliff-like falls in performance. We can’t let programmers write code that accesses arrays-of-pointers-to-structs (common in Java), and expect for that to be ok. It’s not „a few percent of lost performance on today’s hardware.“ It’s orders of magnitude of lost performance on all future hardware.

… I think that these innovations will continue, and this progress will manifest itself in form of easing the programming while keeping the overhead and performance penalties small, but of course, this will never cause the basic preconditions for data-parallel computations to vanish. Even when specific elements of the hardware architecture may be hidden (with negligible performance impact), ignoring the nature of the problem (and the form of its solution in terms of data structures and algorithms) can never be completely compensated by smarter compilers or less restraining hardware specs.

I suppose it might be possible for the JIT to re-organize such an array into a struct-of-arrays in certain situations (certainly not all), but that would lead to even more unwritten rules and unpredictable performance cliffs.

What for? Is it actually difficult for a Java programmer to use a struct-of-arrays? Is it that difficult for a good Java programmer to grok warps and blocks? All that matters is that they can program in the familiar Java syntax. Mastering a few larger concepts is not hard. All parallel programming–nay, all programming–requires understanding concepts. It’s not like NESL doesn’t require understanding concepts.
…
Cuda is already a lot like nesl, in the sense that the efficient reduction code is a handful of lines, and not 1700 of MPI or SSE intrinsics.

Well, I think we don’t really disagree on this point in general, but maybe only on how far such a „simplification“ is possible or reasonable. I think that the usual Java application developer should not have to care about hardware details, for many reasons that I already pointed out. Instead, he should be able to focus on solving the problem - but (and that’s the more important thing, IMHO) including the application of certain ‚parallel patterns‘, that can easily be mapped to any parallel hardware (and I think that the JIT compilation has a huge potential for solving the latter).

alexd457 · 21. Januar 2014 um 17:22

Funny thing about an L1 cache is that Fermi (the previous Nvidia architecture) had one, but Kepler (the current architecture) effectively removed it. Furthermore, in Fermi it had worse performance and stricter requirements for things like coallescing than shared memory. https://devtalk.nvidia.com/default/topic/540044/question-on-the-l1-caching-of-the-gk-110/

“Parallel patterns” are not universal. There are multi-core machines (where most language-designer attention has been focused recently, eg with Java 8 Streams), cluster machines (which don’t have shared memory and don’t map to anything that can be written in “plain Java”), and vector machines. Typical Java (ie, reference types living all over the heap) doesn’t map to vector machines either.

Marco13 · 22. Januar 2014 um 06:46

From the discussion in the NVIDIA forum, it seems that the differences between the caches on both architectures are rather subtle, and it seems to be hard (and for me, at the moment, frankly: impossible) to consider how they affect the actual code. I consider this as one confirmation of one of my points, namely that it’s hard to tweak the code for the quickly changing particularities of the hardware, but of course, this is in doubt just what I want it to mean for me

[QUOTE=alexd457]
„Parallel patterns“ are not universal. There are multi-core machines (where most language-designer attention has been focused recently, eg with Java 8 Streams), cluster machines (which don’t have shared memory and don’t map to anything that can be written in „plain Java“), and vector machines. Typical Java (ie, reference types living all over the heap) doesn’t map to vector machines either.[/QUOTE]

Things like multi-node clusters (or in general, non-SMA/UMA architectures) are - to my knowledge - primarily accessed using dedicated libraries like MPI, and have limited support from the programming language „cores“ (except maybe in languages like Erlang, but not in the „mainstream“ languages, at least). And in our previous discussion, I (silently) assumed that the strict separation between „multi-core-machines“ and „vector machines“ will vanish sooner or later. Of course, at the moment, they can be considered as being completely different, from the conceptual point of view, but as the number of CPU codes increases, and the CPU and GPU are brought closer together (APU, with the HSA etc), it will soon be possible to consider 128 CPU cores as „a vector machine“. And from the software side, I think that this will be exploited, and for this exploitation, the language features are a quite important aspect: As soon as the language supports things like Java 8 Streams, and the VM engineers simply can assume (or detect at runtime) that there are 128 CPU cores (or even a „real“ vector processor) they could - at least theoretically - instruct the VM to arrange something like a List<Person> in memory in a form that is suitable for vector computations (that is, as an array of structures). And I think that this can be done transparently for the programmer, thanks to the fact that the VM has full knowledge about the software (that is, whether this list is „large“, and whether there are many stream operations applied to this list), as well as the hardware (namely whether it is a single core mobile phone, or a „vector machine“ in the broader sense). Maybe I’m naive or have to high expectations in the … possibilites (or the abilities of the JVM engineers?), but don’t see why this should not at least theoretically be possible…

alexd457 · 22. Januar 2014 um 07:58

The differences between caches are not subtle. In Fermi it was a real L1 data cache. In Kepler, because the cache is not coherent across multiprocessors, the cache does not work for ordinary global memory accesses. Effectively, there is no L1 data cache in Kepler.

The difference between multi-core processors and vector processors will not vanish. Given the transistor budget to build a 100-core non-vector processor, it is possible to build (for the sake of argument) a 1000-ALU vector processor. The principle is to re-purpose a lot of the transistors that do control and peripheral logic for actual work.

I don’t know if it might be possible for the JIT to copy a List into a structure-of-arrays for fast vector computation. The JIT would need to know who else has references to, and might concurrently modify, the Person objects. Right now, the JIT does similar optimizations (moving an object that should be on the heap onto the stack), but only for variables that aren’t used outside a function. This is called escape analysis. http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html Escape analysis, because of its limitations, works statically (at JIT time). For a public variable, you need to know who is referencing at run-time, or insert locks. Locks might work. But the overhead of the run-time tracking, or the locks, and the copying itself is a problem.

And, again, why not just use primitive arrays in the first place?

I think that I should clarify my position thus: If optimizations can be introduced by the compiler completely transparently, then that’s of course a win. (Ie, auto-vectorization, perhaps even applied to GPUs.) But if a programmer is taking the time to learn and use a framework like Aparapi, and the authors are hiding or obfuscating important details “for the user’s sake”, and the performance drastically suffers or the programmer needs to learn a lot of unwritten rules, then it’s not a win. Noone uses frameworks like that.

*** Edit ***

Btw, you mention Streams often. Have you actually used them? They suck, big time. They execute serially at their own will (they often refuse to execute parallely if the collection is small, even though the work for each item is large), and when they execute in parallel, they don’t divide work efficiently. For non-parallel operations (ie, convenient manipulation of collections), the API is compromised (ie, lots of missing capabilities vs LINQ). The javadocs are confusing.

Marco13 · 23. Januar 2014 um 01:59

[QUOTE=alexd457]The differences between caches are not subtle. In Fermi it was a real L1 data cache. In Kepler, because the cache is not coherent across multiprocessors, the cache does not work for ordinary global memory accesses. Effectively, there is no L1 data cache in Kepler.
[/quote]
I thought I roughly understood it like the cache on Kepler being some sort of „extra storage for the case that he runs out of registers“, but again: These are details that I have not yet wrapped my head around (not least because I’m still using a 8800 - but I hope that I can change this soon…)

The difference between multi-core processors and vector processors will not vanish. Given the transistor budget to build a 100-core non-vector processor, it is possible to build (for the sake of argument) a 1000-ALU vector processor. The principle is to re-purpose a lot of the transistors that do control and peripheral logic for actual work.

I don’t know if it might be possible for the JIT to copy a List into a structure-of-arrays for fast vector computation. The JIT would need to know who else has references to, and might concurrently modify, the Person objects. Right now, the JIT does similar optimizations (moving an object that should be on the heap onto the stack), but only for variables that aren’t used outside a function. This is called escape analysis. Java ™ HotSpot Virtual Machine Performance Enhancements Escape analysis, because of its limitations, works statically (at JIT time). For a public variable, you need to know who is referencing at run-time, or insert locks. Locks might work. But the overhead of the run-time tracking, or the locks, and the copying itself is a problem.

And, again, why not just use primitive arrays in the first place?

Maybe I was also not clear in this point: Of course, there will still be vector processors, for the exact reasons that you mentioned (more bang - that is, transistors - for the buck). But I think that the compute units of vector processors will become more powerful (in terms of capabilities, referring to things that go into the same direction (but beyond) predicated execution etc), and at the same time, the number of cores of a multi-core processor will increase, so that they can „emulate“ a real vector processor. But of course, prediction is difficult, especially about the future, and I’m certainly not involved enough into the plans of chip designers to really make „profound“ assumptions about the future development.

I think that I should clarify my position thus: If optimizations can be introduced by the compiler completely transparently, then that’s of course a win. (Ie, auto-vectorization, perhaps even applied to GPUs.) But if a programmer is taking the time to learn and use a framework like Aparapi, and the authors are hiding or obfuscating important details „for the user’s sake“, and the performance drastically suffers or the programmer needs to learn a lot of unwritten rules, then it’s not a win. Noone uses frameworks like that.

Well, mainstream-GPU computing (or OpenCL, to put it that way) are comparatively new (although… some people might even consider Java as comparatively „new“ ;)). And with Aparapi, Gary Frost did real pioneering work. (I attribute it to him, since according to the change log, it basically seems like a one-man-project). It’s still under heavy development, and given the quick evolution of parallel programming, this is certainly challenging. I’m not sure how clear certain goals have been when the project started, but AFAIK there ARE people using it, and they ARE gaining considerable speedups. Of course, there will always be problems that can be solved more efficiently with OpenCL than with Aparapi. It’s a trade-off: One can choose whether he accepts a „small“ performance penalty, given a „large“ simplication of the code/development. How small/large they are depends on the application case. When it allows to achieve a good speedup (with low learning effort) for many „common“ application cases, that’s fine. Maybe it’s just not the right tool for what you intend to achieve? In any case, I think that it basically paved the way for Sumatra, and even if it will not become widely used de-facto standard library, it is IMHO an important library in many ways.

Btw, you mention Streams often. Have you actually used them? They suck, big time. They execute serially at their own will (they often refuse to execute parallely if the collection is small, even though the work for each item is large), and when they execute in parallel, they don’t divide work efficiently. For non-parallel operations (ie, convenient manipulation of collections), the API is compromised (ie, lots of missing capabilities vs LINQ). The javadocs are confusing.

Admittedly: No. I downloaded JDK8, but did not yet find the time to play with it. I’ve seen some technical conference presentations and read some articles, and at least based on that, I think that there’s a considerable potential, but I’m not yet familiar with any best practices or so. For example, when you say that „they often refuse to execute parallely if the collection is small“, then this could in the first place just mean that they are simply not intended to distribute the workload imposed by a 4-element-list among 4 processors, but rather to distribute 4000 elements on 4, 40, 400 or 4000 cores (vector-like…) but until now, that’s just a first guess