First of all: Yes, it should be possible to re-use pointers that have been allocated to point to memory regions that are larger than the memory that is actually required.
(I once considered creating some slightly more object-oriented wrappers for JCuda, and of course, something like this (c/sh/w)ould include some sort of „GPUBuffer“ class that does this transparently - similar to an „ArrayList“).
For the memory optimization in general, there are obvious cases, e.g. you should usually not write
for (int i=0; i<large; i++)
{
Pointer pointer = new Pointer();
cudaMalloc(pointer, size);
workWith(pointer);
cudaFree(pointer);
}
when
Pointer pointer = new Pointer();
cudaMalloc(pointer, size);
for (int i=0; i<large; i++)
{
workWith(pointer);
}
cudaFree(pointer);
will do the same. And for the case that the „size“ is not constant, it is probably a good idea to consider allocating the maximum required size only once, in the beginning. (Note that I said that „considering“ it is good, not that „doing“ it is good: There may always be cases where this is not appropriate)
However, the time for the allocations should usually not be the bottleneck, and particularly, they should not nearly be as high as you described. Regarding the site that you linked to: Note the information that is summarized at A Note About These Measurements :
A Note About These Measurements:
Unless otherwise noted, the data shown on this site were measured on a machine (Barracuda10) with […] the following software configuration:
Ubuntu 7.10 (64-bit)
NVIDIA driver version 177.67
**CUDA Toolkit version 2.0**
**CUDA SDK version 2.0 Beta2**
(e.b.m).
Oh, I remember these days. We were young. CUDA was new. Allocation was slow (obviously)
I just scribbled down another test…
import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;
import java.util.Locale;
import jcuda.Pointer;
import jcuda.runtime.JCuda;
public class AllocationBenchmark
{
public static void main(String[] args)
{
int runs = 20;
for (int size = 1; size <= 1 << 28; size <<= 1)
{
runTest(size, runs);
}
}
private static void runTest(int size, int runs)
{
JCuda.setExceptionsEnabled(true);
Pointer pointer = new Pointer();
byte data[] = new byte[size];
long before = 0;
long after = 0;
long totalAllocNs = 0;
long totalFreeNs = 0;
for (int i = 0; i < runs; i++)
{
before = System.nanoTime();
cudaMalloc(pointer, size);
cudaDeviceSynchronize();
after = System.nanoTime();
totalAllocNs += (after - before);
cudaMemcpy(pointer, Pointer.to(data),
size, cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
cudaMemcpy(Pointer.to(data), pointer,
size, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
before = System.nanoTime();
cudaFree(pointer);
cudaDeviceSynchronize();
after = System.nanoTime();
totalFreeNs += (after - before);
}
double avgAllocMs = totalAllocNs / 1e6;
double avgFreeMs = totalFreeNs / 1e6;
System.out.printf(Locale.ENGLISH,
"Size %14d alloc %12.4f ms free %12.4f ms
",
size, avgAllocMs, avgFreeMs);
}
}
and the timings are along the lines of
...
Size 33554432 alloc 5.9246 ms free 12.7367 ms
Size 67108864 alloc 6.0273 ms free 15.9438 ms
Size 134217728 alloc 6.7702 ms free 32.1162 ms
Size 268435456 alloc 8.3200 ms free 60.5944 ms
(on a GTX 970 with CUDA 7.5)
Although I’m a bit surprised to see that freeing is slower than allocating, the time is far away from 1 second…
[ot]
(I wonder whether the fact that I just wrote this down is an indication that by „benchmarking library“ is the sledgehammer that is supposed to be used for cracking a nut, but … I’ll probably continue to fiddle around with this sledgehammer, maybe it will become useful one day)
[/ot]