I’m implementing GPU calculation in a program already written in Java.
I need a fast host to device memory transfer of, sometimes, relatively large arrays. If I want to use streams, I have to use pinned memory. The problem is if I want to allocate host pinned memory larger than cca 600 Mbs of RAM, I get “CUDA_ERROR_OUT_OF_MEMORY” exception.
This is the code I used to test size of the available pinned memory:
//Init GPU
JCudaDriver.setExceptionsEnabled(true);
// Initialize the device and create device context
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
Pointer p = new Pointer();
int Kb = 1024;
int Mb = 1024 * Kb;
int Gb = 1024 * Mb;
int sequenceSize = 172*Mb; // times 4 for float
float[] expecteds = new float[sequenceSize];
float[] actuals = new float[sequenceSize];
Arrays.fill(expecteds, 3.33f);
int i = 0;
try {
JCudaDriver.cuMemAllocHost(p, sequenceSize* Sizeof.FLOAT);
FloatBuffer fb = p.getByteBuffer(0, sequenceSize* Sizeof.FLOAT).
order(ByteOrder.nativeOrder()).
asFloatBuffer();
fb.position(0);
fb.put(expecteds);
fb.position(0);
fb.get(actuals);
JCudaDriver.cuMemFreeHost(p);
} catch (Exception e) {
e.printStackTrace();
JCudaDriver.cuMemFreeHost(p);
}
}```
Now, I'm aware that OS can prevent me to use too much pinned memory since it's non-pageable. The thing is that I have 48Gb (45Gb free) of physical memory and I need a way of forcing OS to give me more of it. Is there a way to do this (elegantly if possible)?
OS is 64-bit Windows 7 Professional SP1
There recently was another thread about an operation involving large memory allocations ( http://forum.byte-welt.de/showthread.php?p=17931#post17931 ) - I still think that there was a limit for the maximum size of one allocation, but could not find anything in the documentation.
But the situation may be different here anyhow, I’ll have a closer look at this beginning of next week
I just ran a test on a 24GB (Win7 64) machine, using a slightly modified program (see below). It was able to allocate 1GB of native (host) memory before bailing out.
When I have the chance, I’ll try to run another test with a C program, and see whether it’s possible to allocate larger blocks there.