Shared Virtual Memory in System (host) RAM

rbrodt · 19. Mai 2019 um 15:48

I am currently working on a project which requires memory allocation of a HUGE amount of data (on the order of 1 Terrabyte) in System memory.

From what I’ve read about SVM (here, for example) OpenCL 2.0 does support the capability of allocating SVM on the host side allowing access by both the kernel and the host, so theoretically it should be possible to use very large chunks of memory (up to the maximum available System memory on the host machine). This is, of course, assuming the GPU address bus is wide enough.

Am I correct in assuming that this is not currently supported in JOCL? I have found the deprecated method CL#allocateAligned(int size, int alignment) which, I’m guessing, is what is needed to be able to allocate System memory from Java. However, this method only accepts a 4 byte signed “int” for the “size” parameter, which reduces the maximum amount of memory that can be allocated to 2GB. I understand that this is because of the limitation in the underlying DirectByteBuffer class, which also only accepts an “int” for “capacity”. After poking around in DirectByteBuffer I found that Unsafe#allocateMemory(long size) used to do the actual System memory allocation, does indeed support “long” values.

I am proposing to add an overloaded allocateAligned(long size, int alignment) and a corresponding allocateAlignedNative() method to accept an 8-byte signed value, and then providing a new class, e.g. HugeDirectByteBuffer which accepts a “long” for “capacity”. Does this sound like a reasonable thing to do?

Marco13 · 19. Mai 2019 um 21:03

You seem to be doing some interesting stuff there.

The allocateAligned method had been deprecated pretty early, and IIRC it was only introduced due to some not-so-clear statements about alignment requirements in the first versions of the OpenCL spec. The deprecation statement should make clear that it should not be used. Usually, a normal ByteBuffer#allocateDirect should be fine for every case.

^{(A small disclaimer: I haven’t yet done extensive experiments with SVM. In fact, these subtle alignment requirements might become relevant if you want to handle structures (including things like a cl_float2 or so) with SVM that might need a larger alignment size. If you have any insights or can foresee where this will definitely be necessary, I’d be interested to hear that)}

Regarding the actual goal: If I understood this correctly, then the main goal is to allocate some memory - basically as a cl_mem - that is larger than 2GB. I think that Java will place more than one hurdle into your way there.

Some thoughts, just as discussion points until now:

Creating a HugeDirectByteBuffer class with some copy+paste from some existing *ByteBuffer implementation or so may not be so trivial. The direct byte buffers use some internal mechanisms that are close to the JVM and not immediately publicly available. This mainly refers to the cleanup (i.e. freeing the memory) in relation to the garbage collector. But I’d have to review the details here
Are you going to access this HugeDirectByteBuffer from Java, and if so, how? I mean, you could change all the get(int) and put(int) method signatures with the corresponding long counterparts. But keep in mind that the ByteBuffer methods are so fast also because they are intrinsics - this couldn’t be the case for a custom class…
A quick websearch yields https://stackoverflow.com/q/24026918/3182664 , just as a pointer

Regarding a possible solution, again, a few thoughts:

It could be possible to divide the 1TB memory into chunks of ~2GB. Unless you really need >2GB in a single kernel call (which sounds ‎adventurous)
It could be possible to create something like a HugeDirectByteBuffer, but considering the caveats mentioned above, I’m not sure how reasonable this approach could be
Brainstorming: Maybe one doesn’t really need a HugeDirectByteBuffer class that mimics the real *ByteBuffer implementations. Maybe one could create a class that (similar to the sketch in the stackoverflow answer) simply provides a view on multiple direct ByteBuffer instances, which are in turn stored in a list. (This could alleviate some of the potential GC and performance issues)

(and that’s something that I have to admit here: )

Yes, if I had to re-design JOCL and JCuda from scratch, I’d put waaay more effort into the Pointer class. Just wrapping a long address does not offer enough flexibility, particularly for things like SVM. The Pointer class from JNA is way more powerful.

rbrodt · 19. Mai 2019 um 21:23

Hey Marco, I really wasn’t expecting you to reply on a Sunday afternoon but thanks!

Just to further clarify, the ~1TB memory block will NOT be managed by the JVM - it is completely managed by our java application; allocated as System virtual memory at the start of the application and released on exit, so no GC involved here. I’m trying desperately not to give away any IP owned by the company I’m working for here on this public forum, so please bear with me on this…

We really need to be able to address the entire memory allocation as a contiguous block of memory without having to resort to „chunking“ the data into smaller 2GB blocks just to make DirectByteBuffer happy, for performance reasons both in the kernel and in the Java code.

BTW off topic: are you located in Germany by chance?

Marco13 · 20. Mai 2019 um 15:26

Considering that all my projects are spare time projects, responses on weekends and holidays are somewhat more likely

I see that you cannot give away toooo much information here, but the technical question about how the memory is allocated and supposed to be used might be relevant here. I’ll sketch these questions here. If you think that an appropriate answer would put you in the risk of violating IP, then you can always send a mail to jocl at jocl.org. (You may then still be violating IP, but I promise that I won’t tell anyone Seriously: At some point, one has to get his stuff done…)

When you say that the memory is „not managed by the JVM, but by your Java application“, I’m not entirely sure what that means.

More generally:

If you want to use OpenCL SVM, you’d have to use clSVMAlloc (at least, I’m not aware of alternatives here, although they might exist). You can do this through JOCL, and will receive a Pointer. This pointer can directly be passed to OpenCL kernels and used on the device. Using pointer.getByteBuffer(offset, size) you can create ByteBuffer instances that expose parts of this data to Java, after the buffer has been mapped (so note that this could also expose ByteBuffer instances for 2GB „slices“ of the 1TB data)

If this does not cover the application case, then the most crucial questions (for me, right now) are

How are you allocating the 1TB or memory? (Using some custom JNI or so?)
How are you intending to use this memory in OpenCL? (Unless it is OpenCL SVM memory, you’d still have to create the corresponding cl_mem objects, as far as I can tell)
How are you intending to use this memory in Java? (There’s no built-in way for, say, creating a ByteBuffer from a long address, implicitly promising to the VM that this is valid, accessible memory)
Is there some Unsafe trickery involved in all this?

All this boils down to the question of what had to be done in JOCL in order to support the respective use case. From what you said until now, it sounds like exposing some sort of native malloc through the JOCL API would not be sufficient, regardless of whether it’s exposed as a long address or a LargeByteBuffer…

Yes, I’m in Germany, in the Darmstadt area.

rbrodt · 20. Mai 2019 um 18:05

Ah ha, I think I see where the confusion is. I’m trying to do something like this:

In short, I need to allocate System memory (using alignedMalloc) so that I can use the host’s entire address space, instead of being limited by the GPU’s CL_DEVICE_MAX_MEM_ALLOC_SIZE (in my case, 7GB) So, according to the above article, this should be possible. This has nothing to do with clSVMalloc() and I don’t need to clSVMMap() and clSVMUnmap() since our device supports fine-grained access.

The only way I can see of implementing this is to use the java Unsafe class to access this System memory. At this stage, I’m not even concerned about endianess or other host/device differences because the product is a hardware+software solution, so we get to define the hardware and I have the liberty to „code down to the bare metal“ (using Java + JNI ) just to make this work.

Anyway, at this point I guess I’ll just have to jump in with both feet and start hacking up my own version of JOCL (sorry!) Thanks for your insight, and I’ll let you know what happens…

Oh, I was born in Klein Auheim (near Hanau) which is just „around the corner“ from you, but now living in beautiful Colorado

Marco13 · 20. Mai 2019 um 18:31

I see. Sorry, I wasn’t even really aware of this feature. (There hadn’t been much interest in the SVM functionality in particular until now).

So it really seems to boil down to the allocateAligned and freeAligned methods.

^{(This could be implemented with aligned_alloc - cppreference.com , because this is now part of C++11 - it wasn’t back then when the method was originally implemented).}

But in contrast to the current allocateAligned, the method should/could

receive the size as a long
not return a ByteBuffer, but a Pointer instead - analogously to the clSVMalloc method.

This pointer could then be used to create ByteBuffer views on the required part of the data, so that it may be accessed from Java.

From what we discussed now, I think this could make sense and could help so that people wouldn’t have to drill their own JNI path into the JVM just for a single malloc/free call.

Do you think this could be a viable solution?

In fact, I’ve been born and went to school in Hanau (and lived in Hammersbach, 20km from Klein Auheim). That’s the „small world“ they’re always talking about

rbrodt · 20. Mai 2019 um 18:48

Yes, I think this will work! Then all I’d really need to do is add a new allocateAligned that accepts a long, rebuild the native libraries, package the whole thing up and off we go

Thanks for you help, and I’ll let you know if this works.

Yes, small world indeed. I still have (less adventurous ) family there. What do you do for your „day job“ if I may ask?

Marco13 · 21. Mai 2019 um 11:15

If you could try out whether it works for your particular case, that would be great. I could add this functionality and do some “artificial” tests, as a branch in the repo, but could hardly do this before the next weekend. It should not be sooo much work, though, because the relevant building blocks should be there - mainly that of not returning a ByteBuffer, but a Pointer that was created with createJavaPointerObject.

If you encounter any difficulties with building the natives, let me know. The Build Instructions should cover the most important steps, but sometimes, JNI may be fiddly. Are you on Windows or on Linux?

My current day job is (officially, mainly) “IT Consultant” (Freiberufler) but I don’t have so many contracts/projects there right now: I’m currently also having a (part-time) job in a research project at the Hochschule Darmstadt (this will only be a few more months, though). I could only try to guess what you’re doing in Colorado, but won’t dare to do this, assuming that you don’t want to reveal any (company) details…

rbrodt · 21. Mai 2019 um 12:04

Thanks for the support - I will try to get something working by the end of the week and will share my findings with you. Would you mind if I communicated via private messages instead of this public forum? I may have additional questions as I get further along.

Ha! It’s no secret; all you have to do is search for me on LinkedIn if you want to know what I’ve been up to. Mostly I’m retired now but I do love a good challenge, and I just couldn’t resist this one

Marco13 · 21. Mai 2019 um 15:08

Sure, you can send private messages here, or mails to jocl at jocl.org.

I’m not so active on LinkedIn, and your profile doesn’t say much about what you’re currently actually doing, but … neither does mine - I’ve sent you one of these requests…