JCudaDriver.cuMemHostRegister() use case question

calavera · 14. September 2012 um 07:44

Hi,
This question is related to my previous one here.
I’m trying to allocate large memory chunk using native method written in c++ and
fetch memory address as long value via JNI.
Now when I get the address in my Java code I would like to pin it using JCudaDriver.cuMemHostRegister().
Here is the code:

package runners;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;

import java.nio.ByteOrder;
import java.nio.FloatBuffer;

import jcuda.NativePointerObject;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.JCudaDriver;
import tools.CudaTools;

public class NativeFuncTest {

	/**
	 * @param args
	 */
	static{
		System.loadLibrary("../lib/NativeToJavaTest");
	}
	private native long getAllocation();
	public static void main(String[] args) {
		
		NativeFuncTest nt = new NativeFuncTest();
		nt.test();		

	}
	
	public void test(){
		initGPU();
		long adr = getAllocation();
		NativePointerObject ptrObj = new NativePointerObject(adr) {};
		Pointer ptr = Pointer.to(ptrObj);
		JCudaDriver.cuMemHostRegister(ptr, 10*Sizeof.FLOAT, 0);
		FloatBuffer flBuf = ptr.getByteBuffer(0, 10*Sizeof.FLOAT).order(ByteOrder.nativeOrder())
				.asFloatBuffer();
		float[] expecteds = new float[10];
		for (int i = 0; i < expecteds.length; i++) {
			expecteds** = 42;
		}
		flBuf.put(expecteds);
		float[] actuals = new float[10];
		flBuf.get(actuals);
		
		for (int i = 0; i < actuals.length; i++) {
			System.out.println(actuals**);
		}
	}
	
	private void initGPU(){
		// Enable exceptions and omit all subsequent error checks
		JCudaDriver.setExceptionsEnabled(true);
		
		// Initialize the device and create device context
		cuInit(0);
		CUdevice device = new CUdevice();
		cuDeviceGet(device, 0);
		CUcontext context = new CUcontext();
		cuCtxCreate(context, 0, device);
	}

}```

Now, cuMemHostRegister() call throws an exception saying:

Exception in thread “main” java.lang.IllegalArgumentException: Pointer must point to a direct buffer or native memory
at jcuda.driver.JCudaDriver.cuMemHostRegisterNative(Native Method)
at jcuda.driver.JCudaDriver.cuMemHostRegister(JCudaDriver.java:3297)
at runners.NativeFuncTest.test(NativeFuncTest.java:41)
at runners.NativeFuncTest.main(NativeFuncTest.java:32)


My question is: How should I use JCudaDriver.cuMemHostRegister() with jcuda in order to pin this piece of allocated memory.
Thanks

EDIT: Clarification
EDIT2:
         This is the native code I'm using:
```#include "stdafx.h"
#include "runners_NativeFuncTest.h"
#include <jni.h>
#include <stdio.h>

JNIEXPORT jlong JNICALL Java_runners_NativeFuncTest_getAllocation
  (JNIEnv * env, jobject ob){
	  return ((long) VirtualAlloc(NULL, 1024 * 1024 * 4, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE));
}```

Marco13 · 14. September 2012 um 09:32

Hello

I just had a short glance on the code until now, but noticed

NativePointerObject ptrObj = new NativePointerObject(adr) {};
Pointer ptr = Pointer.to(ptrObj);

First of all: It was not intended to create a nativePointerObject like this And more important: With the second line, you are creating a pointer to a pointer. (This pointer-to-pointer does not point to direct/native memory, but anyhow is not what cuMemHostRegister expects).

It might possibly(!) work with
JCudaDriver.cuMemHostRegister(ptrObj ,…
but have not tested it. Is it correct that this question is solely related to an attempt to resolve the other problem ( http://forum.byte-welt.de/showthread.php?t=4082 ) ?

In any case, I’ll have a closer look at both questions beginning of next week.

bye
Marco

calavera · 14. September 2012 um 09:50

Yes, that is my attempt to solve the problem of limited allocation
The idea is not mine but from stackoverflow thread I’ve posted earlier this week.

Thanks again

Marco13 · 14. September 2012 um 13:16

Ok, one point mentioned there as a possible reason (a 32bit JRE) would be „nice“ (because it could be solved easily ) but can hardly be the case - this would alread have caused problems when loading the native libs.

One interesing point might be to see what happens in a purely native (C/C++) program when trying to allocate a larger memory block. (ATM I can’t test this, but maybe next week. ).

Marco13 · 17. September 2012 um 02:10

Although this creation of the NativePointerObject is NOT part of the official API : Did you get it running when using this pointer directly?

I just tried a test similar to that of the other thread, using a ByteBuffer.allocateDirect and trying to register it with cuMemHostRegister, but it also worked only for 1GB.

I think the only possible option is to allocate a larger block, and only register/unregister a specific chunk (of less than 1GB) for the operation that you want to perform. (EDIT > But I think you don’t necessarily need own JNI code for that - probably, ByteBuffer.allocateDirect should serve the same purpose < EDIT)

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.JCudaDriver;
import static jcuda.driver.JCudaDriver.*;

public class LargeMemoryAllocTest2 
{
    public static void main(String[] args) 
    {
        JCudaDriver.setExceptionsEnabled(true);
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
       
        Pointer p = new Pointer();
        int M = 1024 * 1024;
        for (int numElements=50*M; numElements<=500*M; numElements+=50*M)
        {
	        try 
	        {
	        	System.out.println("Allocating "+numElements+" elements");
	        	System.out.println("    Native: "+((numElements * 4L)/M)+" MB");
	        	System.out.println("    Java  : "+((numElements * 4L * 2)/M)+" MB");
	            float[] expecteds = new float[numElements];
	            float[] actuals = new float[numElements];
	            Arrays.fill(expecteds, 3.33f);
	            
	            ByteBuffer bb = ByteBuffer.allocateDirect(numElements * Sizeof.FLOAT);
	            Pointer host = Pointer.to(bb);
	            JCudaDriver.cuMemHostRegister(host, numElements*Sizeof.FLOAT, 0);
	            FloatBuffer fb = bb.order(ByteOrder.nativeOrder()).asFloatBuffer();
	            fb.position(0);
	            fb.put(expecteds, 0, numElements);
	            fb.position(0);
	            fb.get(actuals, 0, numElements);
	            boolean equal = Arrays.equals(expecteds, actuals);
	            System.out.println("Equal? "+equal);
	            JCudaDriver.cuMemHostUnregister(host);
	            JCudaDriver.cuMemFreeHost(p);
	        } 
	        catch (Exception e) 
	        {
	            e.printStackTrace();
	            return;
	        }
        }
    }
}

calavera · 17. September 2012 um 03:37

[QUOTE=Marco13]Although this creation of the NativePointerObject is NOT part of the official API : Did you get it running when using this pointer directly?
[/QUOTE]

No, I can’t use the

cuMemHostRegister(NativePointerObject, long, int);

like this. Is there another way of forcing the address into the Pointer object?
I’m trying to allocate memory with the native code (C/C++) now.

Marco13 · 17. September 2012 um 13:11

Well, it depends - if it is REALLY ONLY a test to see whether it works, one could do some nasty reflection hacks (setting the private “nativePointer” field manually -_- ).

In the last code you posted, you also tried to allocate only 4 MB - are you sure that it is possible to allocate a block larger than ~1GB with “VirtualAlloc”? And IF this is possible: Don’t you think that you still had to pass it to cuHostRegister in 1GB-chunks?

calavera · 18. September 2012 um 03:11

[QUOTE=Marco13]
In the last code you posted, you also tried to allocate only 4 MB - are you sure that it is possible to allocate a block larger than ~1GB with „VirtualAlloc“? And IF this is possible: Don’t you think that you still had to pass it to cuHostRegister in 1GB-chunks?[/QUOTE]
I wanted to try only 4 MB because I wanted to get positive result from the allocation. If I have succeeded, I would try to pin more memory
Basically, I wanted to try this „VirtualAlloc“ strategy because of this post, where same man claims that he was able to pin 20-something gigabytes of memory using this procedure. But I am unable to reproduce this result for now using C++ code.

Marco13 · 18. September 2012 um 03:31

The guess by the original poster, that it might be related to memory fragmentation, is certainly valid and might be one reason for being unable to allocate a large memory block. Also, the answers there talk about the possibility of “other dependencies” which might influence whether it works or not. So for me it’s hard to tell why it does not work… I’m also not familiar with the exact mechanisms of the internal memory management, for example, what exactly it means that memory is “mapped into the GPU address space” - doesn’t this necessarily mean that the size of the memory block which can be mapped is also limited by the free device memory? Of course, a larger area could possibly be allocated, but only a smaller region could be mapped - at least to my understanding… How much device memory do you have?

calavera · 18. September 2012 um 03:34

2Gb, it’s GeForce GTX 680.

Marco13 · 18. September 2012 um 05:01

OK, maybe I can try to educate myself concerning the question what this “mapping” acutally means (internally), but may assumption (or gut feeling?) is that you can not map a block into device memory space that is larger than the largest free block of device memory.

But to summarize and clear things up a little:

You need a large block of host memory
This block has to be page-locked. (Why?) So you can not use ByteBuffer.allocateDirect, but have to use an own JNI method
You want to map this large block into device memory space. But this may not be possible. So could there be the option of mapping only smaller parts of the large block into device memory space, depending on which part of the block is currently required?

calavera · 18. September 2012 um 05:43

[QUOTE=Marco13]
But to summarize and clear things up a little:

You need a large block of host memory
This block has to be page-locked. (Why?) So you can not use ByteBuffer.allocateDirect, but have to use an own JNI method
You want to map this large block into device memory space. But this may not be possible. So could there be the option of mapping only smaller parts of the large block into device memory space, depending on which part of the block is currently required?[/QUOTE]
No, I don’t need one large block of pinned host memory. Actually I need a bunch of smaller blocks of pinned host memory. The reason I used one block is that I implemented my program with smaller blocks that I’ve allocated with cuMemHostAlloc(), and I’ve got „out of memory“ exceptions when I tried to allocate them. So I’ve tried to recreate the error with one large block to see when it breaks. I get the same error if my smaller blocks add up to more than 686MBs of RAM (the one that is over the limit throws the exception).
Blocks have to be page-locked because I want to use streams to coalesce uploads and downloads of these blocks because they can perform calculations independently. Also, I want to increase memcpy bandwidth.
I tried to use my own JNI method only because of the suggestion that I’ve seen. I would rather not to use my own native methods
Is there another way to pin smaller blocks that can add up to more memory?

BTW, I really appreciate your help with this

Marco13 · 18. September 2012 um 06:19

Again, I’m on thin ice here, because I don’t know all the internal mechanisms.

I’m still not so sure how the allocation via a native function could solve your problem, because you would anyhow need to make this memory accessible to Java (and not only to JCuda) in order to read/write it on Java side. The only option there would be to return it to Java as a (direct) ByteBuffer.

But one thought I had in order to avoid the JNI was the following: You could use ByteBuffer.allocateDirect to allocate a large memory block, or multiple small ones. This memory is NOT page-locked by default. But each of these blocks (or chunks of the larger block) could be made page-locked using cuMemHostRegister. Once such a chunk has been page-locked, the appropriate pointer can be used in asynchronous operations. Afterwards, the chunk can be un-registered again.

So the key point is that I don’t see why all the memory has to be page-locked all the time. You could page-lock it just for the time of the asynchronous operation, since allocating or otherwise keeping a large block of memory page-locked seems to be one of the main problems here…

But important: I do NOT know whether (or how) it could influence the JVM when the memory of a ByteBuffer becomes page-locked. My intuition says that it should not care, and this is solely in the responsibility of the operating system, but maybe I’m just too naive here.

And concerning the strategy itself: Do you think it might be a feasible approach? Otherwise, maybe I still did not understand some aspects of your intended use case…