The solution that you mentioned (basically creating two DLLs) could be a reasonable intermediate step. I guess it could boil down to something as simple as loading a jhip-nvcc.dll vs. jhip-hcc.dll, and it wouldn’t even be visible through the Java layer. However, this does not solve the issue of not being able to load kernels at all.
I tried running JavaCPP on HCC. Unfortunately JavaCPP couldn’t parse the header. I didn’t pursue it further, because HCC seems overly complicated for what I need (and it probably won’t work anyway because it adverizes itself as an extension of the C++ language). The next API to try would be ROCr/HSA. It’s the lowest-level, pure C, and allows loading kernels from a file.
But before I jumped further down the rabbit hole, I ran a couple quick and dirty benchmarks. The performance of launching small kernels (in a queue) seems to be good, under 9 microseconds per kernel. However, the performance of hsBLAS is awful (much worse than clBLAS), especially for small matrices. Until AMD invests in a good, hand-written, assembly sgemm, it’s not going to be useful for me, unfortunately.
*** Edit ***
I don’t understand why WeakReference or Cleaner is better than finalize. What makes it cleaner and more deterministic? I’m surprised that a class like Cleaner is being put into the core libraries. It seems quite specialized, useful when someone wants to attach a finalizer (which we’ve come to agree are generally a bad idea) to someone else’s object. Perhaps there is some subtle reason why the reference mechanism is better than finalization, but I don’t think the video made that clear. Maybe the facility is primarily geared for caches, which have a need to do “third party finalization” of unused cached objects. OTOH, apparently DirectByteBuffer was changed to use Cleaner. I hope it wasn’t simply because Java doesn’t have suppressFinalize (https://msdn.microsoft.com/en-us/library/system.gc.suppressfinalize(v=vs.110).aspx)
I had seen your accidental comment about the JavaCPP parser. A while ago, I talked with Samuel Audet (the guy behing JavaCPP) about this as well, around this post: https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/16661-java-wrappers-cuda-javacpp.html#post119470 , but similarly, didn’t pursue this further (shifting priorities here and there…). Also, I did not yet find the time to dive deeper into the relationships of ROCr/HSA/HCC/HIP - which may be a reason of why I am not sure what you mean by “trying” in this case, and have to ask the possibly stupid question: Are you now aiming at something like “JROCr”?
Regarding hcBLAS: Is BLAS something that you are particularly interested in? There’s some competition going on: In addition to CUBLAS, there are clBLAS, CLBlast, Magma, ViennaCL… The latter already is, roughly speaking, “an abstraction layer on a similar level as HIP”, but aiming solely at BLAS: It offers BLAS routines, and allows to plug in different backends.
(And BTW, for clBLAS and CLBlast, I created the corresponding JOCL libraries)
When you talk about “quick and dirty benchmarks”, I have to intervene (even though you are likely aware of that) : It’s difficult. Apart from the general, high-level questions about the setup and the devices and for which device the kernels may be optimized, there are many different, additional caveats. It may be clear that some BLASes may have a higher launch overhead for the individual calls but achieve a higher peak performance for larger matrices and so on. But since you mentioned SGEMM: Note that the BLAS implementors are usually aware of that, and often offer a dedicated “batched SGEMM” routine that works better for many small matrices.
Yes, the video is rather a discussion and does not explain some of the surrounding ideas. Weak references allow a cleanup that is slightly more controllable, in the sense that the references are put into a queue, and can be cleaned up manually. However, using this for automatic cleanups (and the concept of the Cleaner) is still criticized for exactly this reason: It still has some non-determinism.
(BTW: The DirectByteBuffer always used the “Cleaner” internally. The key point is that they are now making the Cleaner public)
Both options are difficult (now talking about a “Pointer”, as an example) : Letting the user explicitly free the memory may cause arbitrarily nasty bugs. When you “free” a Pointer in one thread, and another thread still uses it, everything can happen. Leaving the cleanup to the GC (or just to a dedicated thread working off the reference queue) gives the responsibility to the VM, which may cause nondeterminism, and in the worst case, it could have the same negative effects that an invalid manual free-call has. Thus, I’d rather give the user the control over the allocation/freeing, unless there is a compelling reason not to do so.
I’ll test how this compares to launching kernels directly with HSA (which I have working). HSA is extremely verbose and confusing, but it does offer some interesting options like a configurable memory model.
From a glance at the sample, the API looks simular to that of CUDA. (It does not seem to support runtime compilation, as it was introduced in CUDA recently with NVRTC (Runtime Compilation) :: CUDA Toolkit Documentation - but one could arbitrarily continue the list of what CUDA can do and HIP can’t …). Being able to load modules at runtime is certainly a first important step. I’ll have to look at what a “.code” file is, though. Is it real binary, or something similar to PTX (some HSAIL or other intermediate language)?
You mentioned that you have “launching kernels working” - does this already refer to the Java layer (with or without JavaCPP)? I’d be curious to see the progress. I haven’t pursued it actively recently, because many other tasks (including the update to CUDA 8) have currently a higher priority. I’m not sure how much I could actively contribute, and (also because of some uncertainties regarding my future jobs) cannot make a strong committment to something like JHip right now. But in a broader sense, the general idea of JHip is still prominently on my TODO list…
It’s been two years since the last discussion, but now ROCm hit an important milestone with OpenCL 2.0 support. I haven’t tried it yet (I need to wait for a working Arch Linux package, since this might mess up with my existing X installation). Related to this discussion, it seems that AMD has caught some steam with BLAS and other libraries related to their HIP stuff, so it might be that their platform got to the point that HIP Java bindings might be feasible. Have you followed any of these developments, and what do you think about all this?
I have to admit that I still haven’t caught up on several dozens of buzzwords AMD introduced. With Nvidia, it’s easy: CUDA, cuBLAS, cuDNN, etc. With AMD, there are ROCm, ROC this, ROC that, HIP, moreHIP, MlOpen, AMDGPU, and whatnot; some of those fully functional, some probably just a pile of experimental code…
Time is passing so quickly. There’s still this yellow post-it note at my desk, just saying “JHIP!”.
But I must admit that I currently simply cannot commit myself to another (“open-ended”) project like “JHip”. You already mentioned that AMD was quite busy, and catching up with that is not manageable for me right now.
I’ve got JOCL, JCuda, and a few dozen smaller or larger projects at https://github.com/javagl , but in the end, nobody is paying money for the interesting stuff, so I have to waste a considerable amount of my time with “working” on some ridiculous DevOps- and Frontend crap
I’d be happy to hear about updates, or if someone seriously wants to tackle the Java+HIP challenge. But I won’t be able to play the leading role here.