HIP (AMD's CUDA clone) support

(cross-posted from GitHub: HIP support · Issue #5 · jcuda/jcuda · GitHub)

AMD has recently pivoted in its GPGPU strategy. They have a new open-source software stack, ROCm, for their Fiji (R9 Fury) GPUs and future products, and have seemingly abandoned OpenCL.

https://radeonopencompute.github.io/

ROCm exposes a number of alternative APIs. They have the low-level ROCR (C host API and assembly kernel language) (sample), the high-level HC (C++ host API and C++ kernel language), and the CUDA-emulating HIP API (C host API and C++ kernel language).

Here is a comparison table of syntax between the various APIs: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_terms.md

The HIP API is particularly relevant for this project. It’s basically a platform-independent CUDA Driver API that is compatible with both AMD’s stack and NVIDIA’s stack. This is potentially a very exciting development.

HIP page: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP
HIP blog: http://gpuopen.com/tag/hip/

According to AMD:

  • HIP is very thin and has little or no performance impact over coding directly in CUDA or hcc „HC“ mode.
  • HIP allows coding in a single-source C++ programming language including features such as templates, C++11 lambdas, classes, namespaces, and more.
  • HIP allows developers to use the „best“ development environment and tools on each target platform.
  • The „hipify“ tool automatically converts source from CUDA to HIP.
  • Developers can specialize for the platform (CUDA or hcc) to tune for performance or handle tricky cases

It would be very cool if there was perhaps a JHIP library for Java that would allow me to write HIP code. To be honest, I have not tried HIP yet and can’t comment on how well it works in practice. However, I wanted to put it on your radar.

Note, I am NOT affiliated with AMD. I just like competition :slight_smile:

Although I heard rumours about „something like a CUDA implementation by AMD“, I didn’t have this on the radar until now (I’ve been a bit busy recently - usually, I also try to repsond faster to opened issues).

So thanks for bringing this up here!

There already are surprisingly many resources in these repositories. It’ll take some time to sort this out.

From the very first glance at the API, it looks like they just replaced cuda with hip :wink: Although I also like competition (and open standards, in contrast to the very protective policy of NVIDIA), and had always kept an eye on things like the HSA, I wonder why AMD invests so much effort into „something new“, when there already are things like OpenCL and Vulkan. (Of course, one could guess that they want a piece of the CUDA cake, but maybe there’s more behind it).

So right now, I cannot say how likely it is that I’ll create a „JHIP“ library. I’m not sure how much effort I have to invest for this, and how much I can invest.

There are many details that have to be considered. E.g. you mentioned

HIP allows coding in a single-source C++ programming language including features such as templates, C++11 lambdas, classes, namespaces, and more.

This is, to some extent, already possible with CUDA, but the obvious problem is that there will always be a break when going into the Java world.

However, it definitely looks interesting, and I’ll try to have a closer look, and try to better sort my thoughts about this.

(Answer by alexd457, from GitHub issue description: )

I will repost this on the forum so that others can participate in the discussion.

Since I’d like to start playing with HIP and want to stay in Java land, I’m going to initiate the porting effort and see how it goes. I hope you’ll be available in case I run into problems. One person from the HIP side has offered to help, so I think this project should go smooth as butter.

One quick question: did you write all of the code by hand, or did you automate it?

I think AMD gave up on OpenCL because OpenCL is stupid (so to speak) and the abstractions that it envisioned have worked to hinder the programmer rather than help him. (For example, they thought they could avoid explicit mem copies :eyeroll: ) Their new effort is closer to the metal. (In fact, they’re abandoning chunks of HSA too, and for similar reasons, stating that they’d rather optimize their LLVM IR-to-binary compiler than the LLVM-to-HSAIL-to-binary path.) I think the driving force behind these decisions was that the vision for OpenCL (and HSA, I guess) was to be a standard that crossed whole classes of hardware, when in fact the only accelerator that anyone wants to write code for is the familiar dedicated GPU. Keep It Simple, Stupid.

(I did not yet have a closer look at HIP - as already mentioned above, it will take a few days, and I’m still figuring out how to schedule this with my other TODOs)

One quick question: did you write all of the code by hand, or did you automate it?

The answer is: Yes :wink: A brief history of JCuda: Back in 2008, I started with JNI-bindings for CUBLAS. The functions of CUBLAS are very simple and regular, and it was fairly trivial to write a „„code generator““ for this - in fact, it initially was a horrible string-split-and-merge-mess :rolleyes: Later, I added JNI bindings for CUFFT. This only contained very few functions, and I wrote them manually. Then I wanted to generalize all this and offer a broader basis - namely „JCuda“. For several reasons(*) I did not consider any sort of auto-generation here. Long story short: The „core“ of JCuda, namely the runtime- and driver API bindings, are written and maintained by hand. For the runtime libraries (CUBLAS, CUFFT, CURAND, CUSPARSE, CUSOLVER), I’m using a code generator. (**)

I think AMD gave up on OpenCL because OpenCL is stupid (so to speak) and the abstractions that it envisioned have worked to hinder the programmer rather than help him. (For example, they thought they could avoid explicit mem copies :eyeroll: ) Their new effort is closer to the metal. (In fact, they’re abandoning chunks of HSA too, and for similar reasons, stating that they’d rather optimize their LLVM IR-to-binary compiler than the LLVM-to-HSAIL-to-binary path.) I think the driving force behind these decisions was that the vision for OpenCL (and HSA, I guess) was to be a standard that crossed whole classes of hardware, when in fact the only accelerator that anyone wants to write code for is the familiar dedicated GPU. Keep It Simple, Stupid.

I’m not so sure about this. From my understanding, the goal of HSA was not only to be a standard that connects the different classes of hardware, but also to blur the borders between these classes. One of the most important steps here would be to make sure that they can access the same memory, with the ultimate goal of making the explicit memory copies superfluous - which can at least be a justification for making them implicit in OpenCL. (Some opt-in would have been nice here, nevertheless).

I’m not sure how much this is based on (political/economic) strategies instead of technical foresight and ambitions. Of course, NVIDIA does not have an interest in OpenCL, because they want to push CUDA and sell their GPUs - preferably in their $$$ Quadro $$$ cards. AMD wants to sell their CPUs and GPUs, and they know that they are the only company that can sell both - so coupling them would be a nice move for them.

I personally think that blurring the borders between CPU and GPU could bring advantages - mainly, by using the same memory for both. And I’m a fan of open standards - also because they enable competition, without this sneaky part of copying the API and replacing „cuda“ with „hip“. But again, I’ll have to take a closer look at HIP before I can even begin to judge what it may all be about in this case.


(*) Why I did not consider a code generator for CUDA itself: The API of CUDA itself is rather complex, and I did not know it back then. I only knew that there are some subtleties which could hardly be taken into account by a code generator. For example, heavy use of void* pointers where not only the type is not known (int*?, float*?), but also the „kind“ of memory - namely, whether it’s host- or device memory.
Nowadays, I would nevertheless try to use a code generator. Admittedly, the process of manually updating the code is a bit tedious. And GitHub - bytedeco/javacpp: The missing bridge between Java and native C++ seems to have managed to create bindings for CUDA as well ( javacpp-presets/cuda at master · bytedeco/javacpp-presets · GitHub ) - although I have not really used this, it seems to work, basically…


(**) The code generator that I’m using is not publicly available. It also does not generate code that can directly be compiled, there are still some manual steps involved. A while ago, I tried to clean it up (to make it publishable) and reduce the number of manual steps (to make it more usable), but it’s still not as streamlined as JavaCPP or others.

Ok, I’ll proceed manually for now. HIP doesn’t have an equivalent of CUBLAS yet, so we’ll cross that bridge when we get there. (AMD has released hcBLAS, but it uses the C++ HC interface, not HIP, although I’ve read that it can work with HIP.)

I haven’t tried javacpp either, but I really like what you’ve done with JCUDA. Looks like instead of throwing exceptions, javacpp just returns error codes. How awful.

To get this right: You are just starting to create „JHIP“?

[QUOTE=alexd457;137218]
I haven’t tried javacpp either, but I really like what you’ve done with JCUDA. Looks like instead of throwing exceptions, javacpp just returns error codes. How awful.[/QUOTE]

That may be a side-effect of the auto-generation: It will translate what it finds, regardless of whether it makes sense (or is desired) or not.
But

  1. It might be possible to configure JavaCPP so that it does something like the exception checks in JCuda (I’m not sure how much configurability these JavaCPP „presets“ really offer in this regard)
  2. If I had to re-design JCuda now again „from scratch“, I would at least consider to make a few things different. And among them would be the exception handling. Instead of having code like
int doSomething() {
    int result = doSomethingNative();
    if (result != SUCCESS && exceptionsEnabled) {
        throw new Exception(...);
    }
    return result;
}

one could at least consider to have something like

class CheckingCudaImplementation implements CudaInterface {
    private CudaInterface delegate;

    @Override
    public int doSomething() {
        int result = delegate.doSomething();
        if (result != SUCCESS) {
            throw new Exception(...);
        }
        return result;
    }
}

But this goes hand in hand with a design decision that has consequences that go much further, namely: Should the methods all be static or not? (Short: It makes things far simpler, for the implementor and the user - but may limit the flexibility in some cases…)

Yes, I am just starting. The repo will be at https://github.com/almson/jhip. The plan is to basically copy the files from JCUDA, comment everything out, and uncomment things piece-by-piece while changing “cuda” to “hip”. This table shows which CUDA functions are currently supported: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/CUDA_Runtime_API_functions_supported_by_HIP.md Seems like a short list. There is also a porting guide: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_porting_guide.md And a “hipify” tool that tries to automatically do the port, whose source code I’ll consult.

I will also expand the scope of a Maven plugin I’ve developed to automate the compilation of kernels. It will have a parameter which will select one or both of hcc and nvcc for kernel (cross-)compilation.

I’m happy to start a discussion of architecture, and hear your thoughts on what you think you should’ve done differently. This is definitely the best time for that. That said, I’m pretty happy with how JCUDA works now. Throwing Exceptions instead of returning error codes is the right thing to do in Java code. CUDA failures are exactly what exceptions were designed for: rare events that indicate failure and are difficult to recover from. Some might argue that CudaException should be a checked exception, because a programmer cannot always prevent it (eg, the device can run out of memory or the driver might crash). OTOH, every method would have to declare it, and it would be way too much needless nagging.

Regarding the use of static: this issue hinges crucially on the concept of a CUDA context. Generally, the context is a global per-thread state. In this case, using static methods makes perfect sense. They’re not idiomatic Java, but only because global state is not idiomatic Java. The alternative might be to try to make context non-global. If we associate the context with an object, then all the methods would become instance methods. This may be a (very) good idea. It does leave some questions, thought. Would it affect performance to call cuda/hipSetDevice frequently? Would methods check if their parameters belong to the correct context (or just let the call fail)? Is it too big of a divergence from CUDA’s documented behavior? (Note that since HIP doesn’t support Driver API manual context management, there should be much fewer problems.)

And I just started looking over the documentation, cloned HIP and browsed a little. Here are some thoughts (sorry, these are not very „structured“ yet)

First, a side note: You mentioned some Maven-Plugin for Kernel compilation - does this already aim at Java/JCuda? Is it publicly available?
(Remotely related: Bringing JCuda into Maven Central is one large issue/milestone, but there are still some issues, and I did not yet seriously tackle this)

Regarding HIP: I’d like to give it a try, but from what I have read so far, all this seems to aim at Linux - no path for Windows-Dummies like me.

Apart from that: The path that you described (starting with a commented-out JCuda, and assemble it step by step) might be safe in the sense that you’ll always have a „working“ state. But it’s certainly not elegant, and could cause some legacy stuff to sneak into jhip. I also ran my code generator over the HIP headers. It could save some time, because it already generates much of the „boilerplate“ code. But again: It still requires some tweaking. Also because there still is no plain C-API for HIP (even the basic header uses templates!), and because of TODOs like this one: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/nvcc_detail/hip_runtime_api.h#L31 (so to say: There is no „real HIP API header“ at all right now!).
However, if you think that this is a reasonable option as well, we could discuss some details (and exchange the generated code or project state in general) via mail (jcuda @ jcuda.org).

In general, one point to consider here is that HIP only „ports“ the CUDA Runtime API - and not the Driver API. This may reduce the overhead quite a lot. (The redundancy between Runtime and Driver API is ridiculous - but it’s a legacy issue, because up to CUDA 3.0, these had been completely separated APIs)

(BTW: jhip.org is already taken. I could offer jgpu.org as a host name, which I once registered, anticipating that there will be some generalization sooner or later :wink: I expected this to be a generalization of JCuda and JOCL, but maybe something like „org.jgpu.jhip“ (or „org.jgpu.hip“) would be fine, too))

There are always some points of which one thinks „Oh, this should have been solved differently“. This does not necessarily mean that I have a clear idea of HOW it should actually have been solved- it’s only the natural reaction to noticing that something is not working nicely :wink:

I have not yet systematically collected these points, but can try to mention some of them, from the tip of my head:

class JCudaAvailabilityTest
{
    public static boolean isCudaAvailable()
    {
        try
        {
            JCudaDriver.cuInit(0);
            return true;
        }
        catch (Throwable t)
        {
            t.printStackTrace();
            return false;
        }
    }
}

with severe constraints regarding the usage (e.g. one may NOT use any JCuda function before doing this test, because it would already bail out then). All this is still related to my initially naive way of just loading the native library (which, back then, also included the „CUDA Emulation Mode“, which has been removed in CUDA 3.0, I think). Technically, it would be easy to change this:

class JCuda {
    static boolean initialized = false;

    // Throws CudaException if CUDA is not available
    public static void init() {
        initialized = tryToLoadLibrary();
    }

    public static void cudaMalloc(...) {
        if (!initialized) throw CudaException("Initialize first!");
        cudaMallocNative(...);
    }
...
}

But this can not trivially be changed now, because it would break backward compatibility (it would be possible to make the „init()“ call optional, but there are still some caveats).

  • The Pointer class: It’s very basic, and my naive attempt of emulating a void* pointer 8 years ago. It does its job, and it’s „reasonable“, but one should at least consider revieving or extending it. Other libraries (JNA, LWJGL, JavaCPP…) have more powerful pointer classes. This refers to multiple things, like typed pointers (like a Pointer<Integer>), but also to more powerful concepts like structs or function pointers. (Particularly the option of handling structs is an issue. LWJGL has gone very far here - far beyond what I can „comprehend“ in all details…)

  • The handling of functions on the native side: This goes into details of the native library loading and DLL management. For JOCL, I had to create an additional layer that calls all functions via function pointers, because the underlying implementations are only called via the ICD (Installable Client Driver) library, which dispatches from the main OpenCL.dll to the actual implementations. I’m not sure how relevant this will be for HIP, but in general, it allows libraries to be a bit more version-agnostic, because it’s then not necessary to link against a specific runtime library

  • The question of static vs. not-static:

Regarding the use of static: this issue hinges crucially on the concept of a CUDA context. Generally, the context is a global per-thread state. In this case, using static methods makes perfect sense. They’re not idiomatic Java, but only because global state is not idiomatic Java. The alternative might be to try to make context non-global. If we associate the context with an object, then all the methods would become instance methods. This may be a (very) good idea. It does leave some questions, thought. Would it affect performance to call cuda/hipSetDevice frequently? Would methods check if their parameters belong to the correct context (or just let the call fail)? Is it too big of a divergence from CUDA’s documented behavior? (Note that since HIP doesn’t support Driver API manual context management, there should be much fewer problems.)

There certainly are pos/cons for both options. The famous example is OpenGL, with JOGL using a „GL object“ vs. LWJGL offering static methods. For me, the point of not being „idiomatic Java“ is not so relevant here: JCuda is a 1:1 mapping of the CUDA API, which is not Java-like anyhow. (Originally, I intended JCuda as a low-level-layer, and wanted to „wrap“ an Object-Oriented layer around this, but this this imposes other challenges (IBM tried „CUDA4J“, and others tried OO-Layers for OpenCL - it’s not trivial). And back when I started JCuda, I did not have a real understanding of the relationship between host threads and CUDA contexts (and some of these concepts have changed throughout the CUDA versions). So even if there was a „JCuda Object“ (or „JHip Object“), I’d still only consider it as an „entry point“ to the functions that otherwise behave exactly like the static ones. The more crucial questions for me are whether there are certain flexibilities and options lost with the all-static-approach. Some rough points:

  • The library loading again :wink: : One could consider something like
    JCuda jcuda = JCudaFactory.create(); // Throws on error
    to have a dedicated „entry point“
  • Wrapping, to optionally add Exception checks or logging or whatever:
    JCuda jcuda = new JCudaWithExceptions(new JCudaWithLogging(JCudaFactory.create()));
  • Mocking for Unit-Tests
  • Different backends. This, in fact, could be relevant for HIP:
// Which HIP do you want to use today?
//HIP hip = HIPs.createNcc();
HIP hip = HIPs.createNvcc();

(but maybe this is not possible/relevant at all - I’m still digging through the docs)

Thank you for laying out these points. It’s a lot to parse, and I won’t be ready to discuss them until I dig deeper. Today is my first day actually working with HIP. You’re right that HIP/ROCM is frustratingly Linux-only, and I haven’t seen any suggestions that it will be ported to Windows. I’ll ask. (For a long time, my development was also on Windows, and I appreciate that CUDA and JCUDA is flexible. In fact, NVIDIA tools seem to work much better there. NSight Eclipse Edition is a nightmare, and something as simple as adjusting fans or overclocking is a pain on Linux, especially on a headless node unless you buy into Tesla.)

Do you think that advanced features of the other Pointer classes are necessary? I’ve tried passing structs to CUDA kernels, and I did it by simply constructing a bytebuffer with the correct contents. Worked well, and was simple to write and understand. I haven’t passed function pointers.

So are you against automatic context management? Is there any other global state in CUDA that you can think of that might be managed in a non-static object?

My Maven plugin is now on GitHub: https://github.com/almson/cuda-compiler-plugin

There is a famous quote by Linus Torvalds about NVIDIA (and the results of googling “Linus Torvalds NVIDIA” bring up something that is occasionally referred to as “…language”…). Who knows what’s happening behind the scenes. However, although things like overclocking are common, I’d still consider them as “special” in the sense that they are not essential for the software development process. Debugging and Profiling are more important, IMHO. I have not extensively worked with NSight yet, because unfortunately, the possibilities for debugging CUDA through JCuda are rather limited. (The Visual Profiler once worked via JCuda, but they broke this back in CUDA 5.0, and didn’t dare to take a look at this).

Regarding the pointers:

Again: I can’t just write down “THE BEST” solution here. I just noticed that there are some limitations, e.g. regarding Structs, Function Pointers, Pointers to arbitrary native Memory, Pointers to Pointers, and one might think about options for making this “better”. The LWJGL pointer Pointer (LWJGL 3.0.1 SNAPSHOT) obviously plays a more central role, because (or although (?)) it only waps a plain address as a “long”. Many “advanced pointer features” are not strictly necessary, but could be beneficial. Typed pointers could bring a tad more type safety (for JNpp, I added TypedPointer (JNpp API Documentation) , but this is only an extension of the Pointer class, and not really more powerful in that sense).
Pointers to Structs are a bit fiddly.
You mentioned the option of passing structs as ByteBuffers. This is reasonable … and currently the only option. It may have some caveats, regarding padding and alignment. And considering the options of

int count = structsByteBuffer.getInt(someOddOffset+4);
float weight = structsByteBuffer.getFloat(someEvenOdderOffset+12);

// vs

int count = structs[3].getCount();
float weight = structs[3].getWeight();

the latter would be more convenient and less error-prone. For JOCL, I created some experimental struct handling: jocl.org - Utilities . But this still requires manual mapping and wrapping. Offering a “more direct” representation of raw data could have performance benefits (less memory copies), offer new options (considering memory that is shared between host and device), simplify things for the user. (Of course, it would make things harder for the developer of the library, but that’s a usual trade-off).

Regarding the context management:

I’m not strictly against either of both solutions. An “all static” approach is reasonable, and directly corresponds to the native way of doing it. I’m not sure whether an instance-based approach would have advantages that outweigh the increased complexity. (A polemic side note: I consider it as a bit inconvenient that one has to pass the “GL” instance around in JOGL, and LWJGL is simpler in this regard). But again, these are just points to consider. I still have to dig further in the HIP docs. (Although I have some free time currently, it’s always filled with other stuff)

Thanks for the pointer to the compiler plugin, I’ll try to allocate some time to test it.

(BTW: It’s not urgent, just a side note for the case that you update e.g. your README the next time: When you refer to JCuda from jcuda.org, then it should be referred to as “JCuda” or “JCuda from jcuda.org”. Sorry for nitpicking, but there is a paper/publication that refers to some mysterious library called “JCUDA”, and another one called “jCUDA”. In fact, people are actually citing the JCUDA paper when they are actually using JCuda. Maybe some of the confusion can be avoided).

What do you mean that debugging of JCuda is limited? Visual Studio was a joy for debugging kernel code (circa CUDA 6.5/7.0), and the Visual Profiler doesn’t care if it’s looking at a JVM process either (used it recently on CUDA 7.5 on Linux and previously on Windows). Just do the regular steps to enable debugging, open the .cu file in VS, connect the Nsight debugger to the JVM, and set breakpoints in the IDE or use asm volatile ("brkpt;");. It worked beautifully (under Windows–under Linux it also kind of works, but extremely unreliably, although I don’t think it has anything to do with Java).

What is the use of typed pointers if kernels invocation is “dynamically typed”? Where would the type be checked?

Regarding structs: are you suggesting it may be possible to map the raw memory of Java objects into something directly accessible from C? That’s crazy talk.

This issue of structs is really an issue of serialization. There is nothing wrong with doing manual serialization, especially when the goal is interoperability. It’s straight-forward. The chances that refactoring or other code changes will break things is smaller. There is more control, which makes solving interoperability headaches easier. It is practical to determine the memory layout on the native side (there are documentation and online resources on doing this), and to conform to such a contract on the Java side using a bytebuffer. Other approaches to non-automagical serialization use annotations to good effect (like JAXB), but I’m not convinced the cost-benefit of this is worthwhile in this case. What we can do is formalize an interface similar to Serializable with method writeStruct(ByteBuffer) (and maybe readStruct(ByteBuffer)) and expose an additional override of Pointer.to. This way, one could write Pointer.to(myObject) or even Pointer.to(myObjectArray) (although we all know that struct-of-arrays is better than array-of-structs in CUDA).

What about this consideration: Should CUdeviceptr implement close and finalize?

I’ve fixed the stylization of JCuda and added a link.

Regarding the debugging: Admittedly, I never managed to really test this. I tried it once, but then found out that establishing this connection for NSight debugging only worked with this „Professional“ version of Visual Studio. I only have the community edition here, and think it doesn’t work with this one. (Did this change in 2015? I think I tried it with 2012).

The profiling: Yes, that’s what it should be like. But since CUDA 5.0, trying to run a JCuda application in the Visual Profiler caused a message
======== Warning: No CUDA application was profiled, exiting
to be printed. I checked this with CUDA 5.0 upwards, several times, and even posted at Profiler error message when profiling JCuda application - CUDA Programming and Performance - NVIDIA Developer Forums (but never received a response).

You said that you tried it „previously“ on Windows. Could it be that this was a PRE-5.0-version?

Sure, for the kernel invocations, everything boils down to some multi-level pointers that „have to be“ void. Similarly, things like cudaMalloc receive a void pointer. The advantages could mainly be in the runtime libraries, and possible libraries that are built on top of the low-level ones. But even there, it would mainly serve as some sort of self-documentation. (And it would still be necessary to offer the option of „casting“ pointers anyhow). Additionally, I have read a bit in other JNI libraries (JNA, LWJGL, Bridj…), and they tend to have more powerful Pointer/Memory abstractions - just something to consider.

Both directions are possible:

  1. On Java side, you can allocate memory. This can be a (direct) ByteBuffer, or a plain Java array. You can pass this memory to a native method. There, you can directly access the Java memory. When it’s a direct buffer, it’s trivial, with GetDirectBufferAddress ( JNI Functions ) When it’s an array, then you have to use GetPrimitiveArrayCritical ( JNI Functions ). There you have a raw void pointer, that you can brutally C-cast to your MyStruct* pointer.

(This is already done in JCuda. When you call cudaMemcpy(devicePointer, Pointer.to(someJavaArray)...);, then it will copy the data directly from the Java array to the device. When you know JOGL or LWJGL, you may know that they (like almost all other JNI-based libraries) heavily rely on direct buffers. However, when I started with JCuda, I tried to support Java arrays as far as possible - because that’s where data is usually stored in Java. Having to add another layer of copying, and forcing people to insert another JavaArray → DirectByteBuffer --(JNI)–> DeviceMemory copying step did not seem desirable).

  1. You can allocate memory in C, e.g. with malloc, and return this C-memory to Java, as a (direct) ByteBuffer. This is also already done in JCuda, in Pointer (JCuda API Documentation) , mainly to access page-locked host memory that was allocated with cudaMallocHost

Leading to the question about structs:

This issue of structs is really an issue of serialization. There is nothing wrong with doing manual serialization, especially when the goal is interoperability. It’s straight-forward. The chances that refactoring or other code changes will break things is smaller. There is more control, which makes solving interoperability headaches easier. It is practical to determine the memory layout on the native side (there are documentation and online resources on doing this), and to conform to such a contract on the Java side using a bytebuffer. Other approaches to non-automagical serialization use annotations to good effect (like JAXB), but I’m not convinced the cost-benefit of this is worthwhile in this case. What we can do is formalize an interface similar to Serializable with method writeStruct(ByteBuffer) (and maybe readStruct(ByteBuffer)) and expose an additional override of Pointer.to. This way, one could write Pointer.to(myObject) or even Pointer.to(myObjectArray) (although we all know that struct-of-arrays is better than array-of-structs in CUDA).

These points are valid (and I already thought about this quite a bit). You might have seen the line
Buffers.writeToBuffer(particlesBuffer, particles);
in the http://www.jocl.org/utilities/JOCLStructSample.java , which does exactly the serialization that you described: It takes the array of Java „struct objects“, and serializes them into the pre-allocated buffer (although it automatically detects the required memory layout). The advantage of a manual serialization (in terms of flexibility) is clear, but as you said: It’s a cost-benefit-tradeoff. One should also consider convenience, performance and flexibility.

Convenience here on the one hand refers to the users, who are generally Java programmers who would often prefer having the option to just write a straightforward Java class. This argument may be weakened by the fact that they’ll have to fiddle with some low-level details in the kernels anyhow. But still, the mapping between a native struct and „the same struct as a Java class“ could often be done automatically, given the appropriate infrastructure. On the other hand, convenience refers to us, the developers. Maybe you already had a look at Vulkan. They are using structs a lot. I mean, a lot as in „basically every (!) function call receives a struct“. You can see all the „Vk…“-structs in Pointer (LWJGL 3.0.1 SNAPSHOT) ). And the mapping between Java classes and structs has to be done somehow. Manually writing all this would be a hassle. (I started „JVulkan“, but did not yet seriously tackle it. There are additional (though related) caveats).

Performance refers to the plain performance overhead that is imposed by serialization. When you have a struct/Java-class like the „Particle“ that is sketched on jocl.org - Utilities , and create an array of, say, 100000 instances of these, then the process of

  • serializing the instances into a ByteBuffer
  • copy the ByteBuffer from the host to the device
  • perform a kernel call
  • copy the data from the device back to the ByteBuffer
  • de-serialize the data back into the structures
    will be prohibitively expensive, and eat up all performance gains (considering that in many CUDA applications, the host<->device copies alone already are a major bottleneck!)

Flexibility here is related to performance, and aims at ways for avoiding the memory copies. Imagine you had a Java interface like

interface Particle {
    float getPositionX();
    ..
    float getVelocityX();
    ...
]

and you could (automatically?!) create an instance of a class implementing this interface. And this implementation could simply pass the get/set-Methods to a ByteBuffer (maybe even with some sun.misc.Unsafe trickery). Then the serialization overhead would be gone, and the user could interact with simple, Java-idomatic Particle instances.
Now, the crucial part:
This backing ByteBuffer could also be memory that is shared between the host and the device! The shared virtual memory in OpenCL ( http://developer.amd.com/community/blog/2014/10/24/opencl-2-shared-virtual-memory/ ) and the unified memory in CUDA ( https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ ) are going in this direction. It would be possible to allocate Data on the Device, and directly use the device memory in Java (through a ByteBuffer that is hidden in the instances of the „struct class“).

Then, all memory copies would be gone.

Of course, this is idealistic. It has several caveats and it raises many questions. But when designing a new API from scratch, I think one should at least keep this in mind. (Things like this could hardly have been foreseen when CUDA 1.0 came out…)

What about this consideration: Should CUdeviceptr implement close and finalize?

I’m not sure what „close“ should do in this case. Should this free the memory? (You’re likely not aiming at implementing AutoCloseable, I guess).
Regarding finalize: You never know when „finalize“ is called. You might do something like

void foo() {
    Pointer pointer = allocateBytes(1000000000);
    // The pointer goes out of scope here, and COULD be finalized
}

but the JVM would only see the Pointer, which takes maybe 20 bytes. It would not know that this pointer is blocking the whole GPU memory. And the JVM might thus decide to not call finalize at all for this instance!. So using „finalize“ for resource management rarely works, and certainly not for the case of GPU memory.

I don’t know if NSight debugging used to need the Pro version, but now it claims to support VS 2015 Community Edition. Visual Profiling was working on Windows as of 6.5. I don’t remember the exact procedure, but there might have been some annoying business with setting environment variables correctly. Currently on Linux there’s an option to “profile all applications” which is the easiest way to avoid frustration. It profiles any CUDA apps launched after the profiler is started (even JUnit or JMH tests).

*** Edit ***

I am very much opposed to this business of “avoiding memory copies” and making this “easy for Java programmers.” I’ll point to the history here. Project Sumatra (I hope you’re aware of it) -> abandoned failure. OpenCL -> ignored near-failure. The reason both APIs sucked is because they embraced this sort of thinking. In terms of performance, this is never a win. First of all, arrays-of-structs are not optimal to begin with. (This is true both in CUDA and in Java! Anyone who writes performance Java code and doesn’t store the important data in primitive arrays is doing it wrong.) Second, memcopying back and forth to the GPU is not optimal. Third, unified memory is not optimal (because it breaks memcpies into lots of little transactions). So you’re talking about performance of a kernel that’s slow no matter what. In this context, the serialization overhead will be small (because writing to a bytebuffer is fast). And what is the point of avoiding memory copies? Are they the bane of novice programmers? Do they make code difficult to think about? No. They’re simple, innocent things that are easy to write and make the code clear. They also replace explicit synchronization/memory barriers.

What users may want to do is to pass a small number of high-level objects to their kernels. For example, I have a Matrix class. In Java it has a pointer to device memory as well as fields for width, height, and stride. I want to pass it to a kernel so that inside the kernel I don’t have 4 separate parameters. In Java I have:


      public ByteBuffer
    toNativeStruct() {
            
            // We have to use a subclassing trick to get to the raw address
            long devicePointer 
                    = new Pointer (myCUdeviceptr)
                            { public long getAddress() { return super.getNativePointer() + super.getByteOffset(); } }
                            .getAddress();
                                    
            buffer.clear();
            buffer.putLong (devicePointer);
            buffer.putInt (width);
            buffer.putInt (height);
            buffer.putInt (pitch);
            buffer.flip();
            
            return buffer;
        }

and in CUDA:


struct Matrix
{
public:
    float * __restrict__ const ptr;
    int const width;
    int const height;
    int const pitch;
    
      __device__ __forceinline__ float& 
    get (int y, int x) { return ptr[(long long)y * pitch + x]; }
};

__global__ void elementwise_multiply (Matrix C, Matrix A, Matrix B) { ... }

About close: Yes, it would free memory. It would be an implementation of AutoCloseable (what else?). finalize would be needed alongside it. No one uses finalize to replace closeable. Finalize is a safety net in case the programmer forgets to call close by mistake. Somebody needs to implement these two things. For example, I implement them in my class Matrix. The question is should everyone implement them on their own, or should JHip just do it and save people work and headache. This goes back to the question of whether JHip should be a simple 100% copy of the C API, or whether we should diverge. (In the diverged API, noone would call cuMemFree, which might seem odd and invalidate a lot of examples. That’s the downside.)

More interestingly, CUmodule also needs close and finalize.

OK, I’ll try Debugging with VS2015 CE then, and see whether the Visual Profiler works in the meantime. (I remember investigating this a bit, and can’t imagine that I overlooked something as simple as a command line option, but … who knows).

While I can’t disagree with the general statements and observations, I have to question the conclusions drawn from them. I think that the fact that CPU- and GPU memory have been separated is a legacy feature that stems from the „evolution“ of GPUs becoming „General Purpose (G)PUs“. Idealistically (and a bit oversimplified) : If one had to design the whole hardware world from scratch nowadays, one would certainly allow the CPU and the GPU to use the same memory. And although this does not happen directly, I think it is the direction in which things are going: SVM, Unified Memory and things like https://developer.nvidia.com/gpudirect show that this is the goal. Or to phrase it that way: Unified Memory in the current form is not optimal. But if it really was the same memory on hardware level, then there would be no transactions at all. Once more, a bit idealistic: If you used unified memory via the given API, then a change in hardware+driver could eliminate all memory copies - transparently for the application! If you use dedicated, manual serialization, then that’s what has to be done. No change in the hardware+driver could eliminate these steps, because they are explicitly written down in the code.

Similarly, the question of SOA vs. AOS: I also had this in mind when I mentioned the option of transparently letting an array of „Particle“ instances being backed by a ByteBuffer. It could equally be backed by multiple bytebuffers, as in

class Particle {
    int index;
    ByteBuffer xArray;    
    ByteBuffer yArray;
    float getX() { return xArray.getFloat(index); }
    float getY() { return yArray.getFloat(index); }
]

And this could also be done transparently, without a change in the public API on the Java side.

Regarding Sumatra: I am aware of it, and think they considered some SOA-vs-AOS-aspects in the mailing lists, but am not sure about the last state of the discussion and implementation. I think that the option of letting the VM - or more precisely, the JIT - decide at runtime how to lay out the data for best performance on the respective target device at least sounds tempting, although it’s clear that this is a very ambituous goal. Particulalry, one should consider that the JVM currently makes assumptions. It expects simple, idiomatic, object-oriented code. The JVM could not optimize such a ByteBuffer-based „Particle“ class in the same manner as a plain POJO.

One could now argue about the priorities an application cases. Again, I don’t disagree: When you want peak performance now, you have to consider the options, do some tests, and implement whatever is fastest now - and in many cases, this may be a manual serialization. But I doubt that this will remain the best solution in the near future.

I see that this is a realistic application case. But I think it does hardly make sense to argue about what is the „best“ solution for this. It could lead to an overly specific discussion. Even when not asking whether the (many, small) Matrices usually had the same width+height anyhow, there is an important question (that would also be relevant for something like the „Particle“ example), namely:
How much is done with these structures on Java/Host side?
When you have a large chunk of data, and do the classical

  • Copy input to device
  • Launch kernel
  • Copy results to host
    loop, then the requirements will be entirely different to the ones in a case where you want to perform extensive computations on this data on the device and on the host.

[QUOTE=alexd457;137614]
About close: Yes, it would free memory. It would be an implementation of AutoCloseable (what else?). finalize would be needed alongside it. No one uses finalize to replace closeable. Finalize is a safety net in case the programmer forgets to call close by mistake. Somebody needs to implement these two things. For example, I implement them in my class Matrix. The question is should everyone implement them on their own, or should JHip just do it and save people work and headache. This goes back to the question of whether JHip should be a simple 100% copy of the C API, or whether we should diverge. (In the diverged API, noone would call cuMemFree, which might seem odd and invalidate a lot of examples. That’s the downside.)

More interestingly, CUmodule also needs close and finalize.[/QUOTE]

I have to say that I’m STRONGLY opposed to giving finalize any semantics. If it freed memory in the case that the programmer forgot it, then it would not act as a „safety net“, but instead it would hide programming errors: The program might then work or fail completely randomly, depending on the GC. Unless you intended an implementation like

void finalize() {
    if (!this.wasFreed) {
        System.err.println("Something wrong, review your code!");
        System.exit(-666);
    }
}

I cannot imagine any advantage here.

In contrast to that, implementing AutoCloseable could be resonable, because it would be non-intrusive: One could still do manual memory management, and it would be guaranteed to work. Alternatively, it could be used optionally to make certain code parts a tad simpler. But admittedly, I think that the simplification that could be achieved here would be negligible:

void withoutAutoCloseable()
{
    Pointer pointer = allocate();
    copyDataTo(pointer);
    launchKernel(pointer);
    copyDataFrom(pointer);
    release(pointer);
}

void withAutoCloseable()
{
    try (Pointer pointer = allocate()) {
        copyDataTo(pointer);
        launchKernel(pointer);
        copyDataFrom(pointer);
    }
}

and only applying to the cases where memory is allocated and freed in the same block. The latter is „rare“ considering that one should usually avoid frequent allocations+frees, and rather do this explicitly as well:

void youShouldNotDoThis() {
    for (int i=0; i<1000; i++) {
        withAutoCloseable();
    }
}

void youShouldDoThisInstread() {
    Pointer pointer = allocate();
    for (int i=0; i<1000; i++) {
        copyDataTo(pointer);
        launchKernel(pointer);
        copyDataFrom(pointer);
    }
    release(pointer);
}

I think that the fact that CPU- and GPU memory have been separated is a legacy feature

That’s apparently a commonly held opinion among software engineers, but it’s incorrect. The reason the memories are separate is because you get more performance doing it that way. There’s two reasons. First is that GPUs use a different kind of RAM (GDDR or HBM instead of DDR) that is connected in a different way (soldered to the board or interposer using 256 or 4096 wires). The GPU trades off latency for bandwidth and expandability for speed. CPU memory has an order of magnitude less bandwidth for a reason. Second is that shared memory requires maintaining cache coherency, which is expensive. This is never going to change. Even multi-CPU systems have memory chips that are dedicated to each CPU, and pretending that the system has a single pool of memory has performance costs.

[hr][/hr]
Your ideas regarding proxy objects backed by bytebuffers are honestly intriguing. As you mentioned, we’d probably lose many JVM optimizations. And it kind of violates KISS. But it would be interesting to try it out.

[hr][/hr]

I have to say that I’m STRONGLY opposed to giving finalize any semantics. If it freed memory in the case that the programmer forgot it, then it would not act as a „safety net“, but instead it would hide programming errors

My implementation of Matrix.finalize actually does log (but doesn’t call exit). I think it’s helpful. I agree with your logic that programming errors shouldn’t be hidden. And yet, fact is that plain finalizers (which don’t log) are used. Take a look at java.io.FileInputStream or the more recent java.util.concurrent.ThreadPoolExecutor. I do see what you’re saying. Why do you think finalize shouldn’t even log?

*** Edit ***

Anyway, regarding my progress with JHip. My plan now is to try JavaCPP to avoid writing or compiling C, but build on top of it with a user-facing wrapper that checks for errors, swaps out datatypes, and makes the interface more like JCuda.

However, I hit a bit of a roadblock. HIP doesn’t expose cuModuleLoad or cuLaunchKernel. It basically isn’t designed for interop. I contacted the developers. I hope they fix this. Working around this (by doing the compatibility legwork ourselves) would be annoying.

[QUOTE=alexd457]That’s apparently a commonly held opinion among software engineers, but it’s incorrect. The reason the memories are separate is because you get more performance doing it that way. There’s two reasons. First is that GPUs use a different kind of RAM (GDDR or HBM instead of DDR) that is connected in a different way (soldered to the board or interposer using 256 or 4096 wires). The GPU trades off latency for bandwidth and expandability for speed. CPU memory has an order of magnitude less bandwidth for a reason. Second is that shared memory requires maintaining cache coherency, which is expensive. This is never going to change. Even multi-CPU systems have memory chips that are dedicated to each CPU, and pretending that the system has a single pool of memory has performance costs.
[/quote]

OK, admittedly, I’m not sooo familiar with what comes below the software level, although I can imagine that the memory usage patterns for CPU and GPU require different architectures (and the hardware guys are likely thinking about at least as many performance/flexibility-tradeoffs as the software guys ;-)).

A log would be ok-ish, for debugging. But I wouldn’t give it any semantics. Stating something like „The memory will be freed on finalization“ will lead to problems. Websearches for ThreadPoolExecutor finalize shutdown show that this already caused several problems and there are subtle caveats. I’m not a fan of blindly following advices that „Gurus“ give, but I think that one can at least understand the reasoning and consequences that Josh Bloch gives along his unambiguous, clear statement: Avoid finalizers, Item 7 in „Effective Java“ ( Item 7: Avoid finalizers | Creating and Destroying Java Objects | InformIT ).

I don’t know enough details of JavaCPP to say anything about this. But I think the validity of this path also depends on this issue:

[QUOTE=alexd457;137626]
However, I hit a bit of a roadblock. HIP doesn’t expose cuModuleLoad or cuLaunchKernel. It basically isn’t designed for interop. I contacted the developers. I hope they fix this. Working around this (by doing the compatibility legwork ourselves) would be annoying.[/QUOTE]

I have seen the issue update, and am curious how they are going to tackle this. It seems like the „break“ that I mentioned in

seems to be harder than initially anticipated. I had noticed the quote in HIP to be Squared : An Introductory HIP Tutorial - GPUOpen

We need to make one manual change to the signature of the HIP code. HIP passes execution configuration (the grid and threadBlock sizes) to the kernel through a standard C++ structure, as opposed to using custom “<<< >>>” language extensions used in CUDA. The hipfiy tool does not currently modify the kernel signature automatically, so you need to do this step by-hand.

but did not notice that they don’t have a real option for programmatically loading a module at all.

The mapping of hipLaunchKernel for the NVCC path seems to be straightforward: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/nvcc_detail/hip_runtime.h#L32 . I did not yet fully analyze the HCC path at https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/hcc_detail/hip_runtime.h#L543

BTW: Your comment regarding the separated header files at Java bindings · Issue #32 · ROCm/HIP · GitHub seems to be related to the TODO that I alreary pointed at in #8 of this thread: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/include/nvcc_detail/hip_runtime_api.h#L31 . It seems like pulling out the API into a common header is already planned, but not done yet…

Thanks for helping out in the github discussion. (Writing sometimes drains me even more than coding.) I have to say the response from the AMD team was… disappointing.

*** Edit ***

HIP has two parts to it. The part that is a compatibility layer for host code, and the part that is a compatibility layer for kernel code. On the host side we can implement the compatibility stuff in Java. I’m sure HCC has a way to load binary modules. Kernels would still need hip_runtime.h, though.

*** Edit ***

Thanks for pointing out the irony in ThreadPoolExecutor.finalize. If all threads have to exit before finalize runs, what useful work does finalize even do? I guess it calls the ThreadPoolExecutor.terminated() callback. That’s the most useless (and misleading) finalizer I’ve ever seen.

Yes, I also wasn’t sure how to interpret the answer that was given there. It could either mean “Yes, we’re already actively working on that” or “Heck, what do you want?”.
Let’s see how it turns out.
(It’s a bit strange that there was basically no activity for >1 month. This could be due to vacations, and doesn’t have to be a negative sign. (My repos also seem “inactive” occasionally - although they are a one-man-show, and this can hardly be compared). But the HCC/ROC repo shows some activity, at least)

The HCC equivalent to a cubin is hsaco. Here is an article describing how to load and execute hsaco file: ROCm With Harmony: Combining OpenCL, HCC, and HSA in a Single Program - GPUOpen