Improved 2D Pointer allocation for constant length. MemCpy ?

I bet many knew this but still:

On stackoverflow I saw this nice idea on how to allocate a 2d pointer on the device where each row has the same length.

Instead of allocating device memory for each single row in a loop a pointer with the whole length (width * height) can be allocated in a single call and then pointers can be derived by offseting that pointer.

Pseudocode:

Instead of:

int height, width;

int rowPointerSize = width * SizeOf.Byte;
Pointer[] rowPointers = new Pointer[height];

for(int i = 0; i < rows; i++)
{
	rowPointers** = new Pointer();
	JCuda.cudaMalloc(rowPointers**, rowPointerSize);
}

Saves multiple cudaMalloc calls:

int height, width;

//Pointer holding width * height elements
Pointer rowsPointer = new Pointer();
JCuda.cudaMalloc(rowsPointer, width * height * SizeOf.Byte);

Pointer[] rowPointers = new Pointer[height];
int rowPointerSize = width * SizeOf.Byte;

//get the offseted pointers
for(int i = 0; i < rows; i++)
	rowPointers** = rowsPointer.withByteOffset(rowPointerSize * i);

Finally:

Pointer matPointer = new Pointer();
final int matPointerSize = height * SizeOf.Pointer;
JCuda.cudaMalloc(matPointerSize, SizeOf.Pointer);
JCuda.cudaMemcpy(matPointer, Pointer.to(rowPointers), matPointerSize, cudaMemcpyHostToDevice);

This could also be used to reduce the memCpy calls, to receive the whole 1D data from rowsPointer and build the result 2D data on the host from this?

I don’t know how big the performance boost would be. Don’t know how expensive are just the calls and not their execution, depending on the number of saved calls.

Yes, this can be useful in some cases.

In many cases, it’s beneficial (and not overly complicated) to use an 1D array instead of a 2D array:

int **array2D = ... // Complicated allocation
array[x][y] = 123; // Easy access

vs.

int *array1D = ... // Easy allocation
array[x+y*width] = 123; // Still relatively easy access

But of course, if one wants a “real” 2D array, then the “row pointers” can be computed as you pointed out in your example.

There may be cases where it makes a difference. For example, when you have code like

Pointer temp = rowPointers**;
rowPointers** = rowPointers[j]
rowPointers[j] = temp;

(i.e. when two row pointers are swapped) then this swapping is not reflected in the single allocated memory block, but in most cases, something like this is not necessary. The example from the website, using an array of pointers, was also intended to show how such a 2D Pointer handling may be done, since it can be confusing regarding the question where the pointers reside - on the host or the device.