I bet many knew this but still:
On stackoverflow I saw this nice idea on how to allocate a 2d pointer on the device where each row has the same length.
Instead of allocating device memory for each single row in a loop a pointer with the whole length (width * height) can be allocated in a single call and then pointers can be derived by offseting that pointer.
Pseudocode:
Instead of:
int height, width;
int rowPointerSize = width * SizeOf.Byte;
Pointer[] rowPointers = new Pointer[height];
for(int i = 0; i < rows; i++)
{
rowPointers** = new Pointer();
JCuda.cudaMalloc(rowPointers**, rowPointerSize);
}
Saves multiple cudaMalloc calls:
int height, width;
//Pointer holding width * height elements
Pointer rowsPointer = new Pointer();
JCuda.cudaMalloc(rowsPointer, width * height * SizeOf.Byte);
Pointer[] rowPointers = new Pointer[height];
int rowPointerSize = width * SizeOf.Byte;
//get the offseted pointers
for(int i = 0; i < rows; i++)
rowPointers** = rowsPointer.withByteOffset(rowPointerSize * i);
Finally:
Pointer matPointer = new Pointer();
final int matPointerSize = height * SizeOf.Pointer;
JCuda.cudaMalloc(matPointerSize, SizeOf.Pointer);
JCuda.cudaMemcpy(matPointer, Pointer.to(rowPointers), matPointerSize, cudaMemcpyHostToDevice);
This could also be used to reduce the memCpy calls, to receive the whole 1D data from rowsPointer and build the result 2D data on the host from this?
I don’t know how big the performance boost would be. Don’t know how expensive are just the calls and not their execution, depending on the number of saved calls.