JCuda: copy multidimensional array from device to host

I’ve been working with JCuda for some months now and I can’t copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).

In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.

Here an example:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts were allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

What am I doing wrong?
(sorry for my bad english :))


Admittedly, the memory allocation inside of kernels is a feature that requires a compute capability >= 2.0, and my main development PC still only has a 1.3 card. So I may not test exactly the same case now, but sometimes I have access to a PC with CC >=2, so I might check it later.

The general question about copying multi-dimensional arrays has also been touched in http://forum.byte-welt.net/threads/4072-Passing-array-of-strings-to-GPU-device-using-jcuda?p=18211&viewfull=1#post18211 (method “processMultiplePointers”), but I think the crucial difference here is that the pointers are allocated on host side, whereas you try to allocate them on device side.

But from a first glance at the code snippet, it’s not entirely clear (to me) what your intention is, mainly concerning the number of pointer indirections. Can you post the kernel (or a simplified version thereof) which shows how the parameter “pointer_dev” is used and how the malloc-call is performed?


Thanks for the answer.

I use a GPU with compute capability >= 3.0
Here is the snippet of the kernel code that uses that variable:

__global__ void kernel(..., short **pointer) {

  int thread = threadIdx.x + blockDim.x * blockIdx.x;

  // ...

  pointer[thread] = (short*) malloc(3 * sizeof(short)); // the dimension is not always 3, but for now I put 3 as an example
  for(int i=0; i < 3; i++) {
    alignment[thread]** = 1; // not always 1, just an example. The real value is determined by the kernel

  // ...

Little mistake: “alignment” is the orginal name of the variable. It’s “pointer” in this case.

OK, I’m afraid that something similar to this might in fact have worked with the previous version (0.5.0a) but it will not work with the current version (0.5.0b), because in this version, another bug related to the kernel parameters pointer was fixed which prevents the native pointers from being written back. I already expected something like this, but did not consider the allocation of memory in kernels (I definitiely need a card with CC >1.3 -_-). I have a rough idea how this might be fixed, and will try to tackle this later today, but this might require some experimenting and testing on a CC >=2.0 machine. I’ll post any updates here, of course.

From what I’ve tried with 0.5.0a I get CUDA_ERROR_INVALID_VALUE when I try to copy the value with the function call:

cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT);

Apparently there is a problem with the previously pointer copied in arrayPtr[0]. I tried the previous function even with ByteCount = Sizeof.SHORT (without “3 *”) but nothing, it seems that the pointer doesn’t point to a valid memory space.
I need this to work otherwise I’m stuck with my project for the thesis :\

A short update: I’ve investigated this a little. It is obviously related to the change from the previous version. But I also ran into the issue of the CUDA_ERROR_INVALID_VALUE when using the previous version. I’ve been messing around with lots of debug output, and could not find the reason for this, and even considered trying this in plain CUDA/C, … but then I did a websearch, and immediately found
http://stackoverflow.com/questions/9414020/invalid-argument-error-when-copying-data-from-device-to-host/9578057#9578057 :

Although this refers to CUDA 4.1, I’d have to scan the release notes to see whether this is actually already possible in cuda 5.0 (and, if it is possible, whether it might be bound to a specific compute capability) - until now I assume that it is not (yet) possible in CUDA 5.0 as well. If you have doubts, I can read the release notes and try to build an example in plain CUDA/C.

Nevertheless, your question made me aware of the fact that the fix for the bug in 0.5.0a that caused a new bug in 0.5.0b, so I’ll definitely have to review this ASAP. Thanks for that.

Although I’m not really a CUDA expert: If you describe what you intended to do, maybe we can find a solution (but you might also want to have a closer look at the Stackoverflow Thread, maybe it already contains some hints).

Thanks for the help :slight_smile:
I’ll let you know if I can find any workaround to the problem.