CUDA Arrays with JCuda


#1

Hi all,

I’m a newbie, looking for examples on how to create 2D arrays, fill with host data, copy to the device, and copy back. Unfortunately in the Programming Guide I only saw examples for allocating 2D arrays on the device, and it was unclear how to fill a newly created array with host data, and then how to copy the array to the device.

I can create linear arrays with pointers to each row, e.g.,

[B][SIZE=2][SIZE=2][LEFT][B][COLOR=#7f0055][SIZE=2][COLOR=#7f0055] for[/B][/SIZE]([B][SIZE=2]int[/B][/SIZE] i = 0; i < height; i++) {
JCudaDriver.[I]cuMemcpyHtoD[/I](pointers[i],Pointer.[I]to/I, width* Sizeof.[I][SIZE=2]FLOAT[/I][/SIZE]);
}
JCudaDriver.[I]cuMemcpyHtoD[/I](pointer, Pointer.[I]to/I,height * Sizeof.[I][SIZE=2]POINTER[/I][/SIZE]);[/LEFT]

[/B][/COLOR][/SIZE][/COLOR][/SIZE]
But i’d like to use the CUDA_ARRAY_DESCRIPTOR with CUDA_MEMCPY2D class formulations. I create the java data array and the cuda array:

float[][] data = …
[LEFT]
CUDA_ARRAY_DESCRIPTOR desc = [B][SIZE=2]new[/B][/SIZE] CUDA_ARRAY_DESCRIPTOR();
desc.[SIZE=2]Format[/SIZE] = CUarray_format.[I][SIZE=2]CU_AD_FORMAT_FLOAT[/I][/SIZE];
desc.[SIZE=2]NumChannels[/SIZE] = 1;
desc.[SIZE=2]Width[/SIZE] = width;
desc.[SIZE=2]Height[/SIZE] = height;
CUarray cuArray = [B][SIZE=2]new[/B][/SIZE] CUarray(); [/LEFT]
[I]cuArrayCreate[/I](cuArray, desc);

After I create the cuda array, I’m assumming (maybe wrongly) that this cuArray exists on the host CPU. Do I then use the CUDA_MEMCPY2D to copy data from data[][] to the cuArray, and how would I do this? And would I then use a cuMemcpyHtoD() to copy the cuArray to the device, or how do I get the cuArray to the device?

I know this is more of a CUDA than a jCuda question, but I couldn’t find any CUDA examples using the cuMemcpy2D(CUDA_MEMCPY2D pCopy) formulation.

Any suggestions and/or examples would be greatly appreciated.


#2

Hello

The question is partially specific for JCuda in so far that it involves a Java 2D array: data[][]. The main problem with these arrays is that they are not necessarily stored as a continguous block in memory.

Despite the strong relationship between pointers and arrays in C, the same problem may occur there as well: When creating a “2D array” in C like this
float array[3][3];
it can be considered as being roughly equivalent to a float[9] array (although at the moment I’m not sure if the C specification really asserts that it will be a continguous memory block). But it is also possible to create an “2D array” as an array of pointers

float **array = (float**)malloc(3*sizeof(float*));
for (int i=0; i<3; i++)
{
    array[i] = (float*)malloc(3*sizeof(float));
}

This may also be seen as a “2D array” and accessed like the first one…
array[i][j] = 123.456f;
This closer resembles the semantics of a “2D array” in Java. But such an array can not be copied from the host to the device using the usual CUDA functions (not even with the Memcpy2D functions), because the array does not store 9 float values, but 3 pointers to floats.

The CUDA functions require the array to be stored as a continguous block. So the only ways to copy a “2D array” from Java to CUDA is to store it as an 1D array, or alternatively, of course, to copy each row separately, as in your first code block.

When stored as a 1D array, the memcpy2D functions and structures may be used as in this example:

import jcuda.*;
import jcuda.driver.*;


public class JCudaDriverArrayTest
{
    public static void main(String args[])
    {
        // Initialize the driver and create a context for the first device.
        JCudaDriver.cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        JCudaDriver.cuDeviceGet(dev, 0);
        JCudaDriver.cuCtxCreate(pctx, 0, dev);

        // Prepare the input and output arrays on the host
        int width = 3;
        int height = 3;
        float input[] = new float[width*height];
        for (int i=0; i<width*height; i++)
        {
            input[i] = i;
        }
        float output[] = new float[width*height];
        
        // Create the 2D array on the device 
        CUarray array = new CUarray();
        CUDA_ARRAY_DESCRIPTOR ad = new CUDA_ARRAY_DESCRIPTOR();
        ad.Format = CUarray_format.CU_AD_FORMAT_FLOAT;
        ad.Width = width;
        ad.Height = height;
        ad.NumChannels = 1;
        JCudaDriver.cuArrayCreate(array, ad);
        
        // Copy the host input to the 2D array  
        CUDA_MEMCPY2D copyHD = new CUDA_MEMCPY2D();
        copyHD.srcMemoryType = CUmemorytype.CU_MEMORYTYPE_HOST;
        copyHD.srcHost = Pointer.to(input);
        copyHD.srcPitch = width * Sizeof.FLOAT;
        copyHD.dstMemoryType = CUmemorytype.CU_MEMORYTYPE_ARRAY;
        copyHD.dstArray = array;
        copyHD.WidthInBytes = width * Sizeof.FLOAT;
        copyHD.Height = height;
        JCudaDriver.cuMemcpy2D(copyHD);

        // Do kernel invocations using the array here
        // ...
        
        // Copy the 2D array to the host output  
        CUDA_MEMCPY2D copyDH = new CUDA_MEMCPY2D();
        copyDH.srcMemoryType = CUmemorytype.CU_MEMORYTYPE_ARRAY;
        copyDH.srcArray = array;
        copyDH.dstMemoryType = CUmemorytype.CU_MEMORYTYPE_HOST;
        copyDH.dstHost = Pointer.to(output);
        copyDH.dstPitch = width * Sizeof.FLOAT;
        copyDH.WidthInBytes = width * Sizeof.FLOAT;
        copyDH.Height = height;
        JCudaDriver.cuMemcpy2D(copyDH);

        boolean passed = true;
        for (int i=0; i<width*height; i++)
        {
            System.out.println(output[i]+" ");
            if (input[i] != output[i])
            {
                passed = false;
                break;
            }
        }
        System.out.println("Test "+(passed?"PASSED":"FAILED"));

        // Clean up.
        JCudaDriver.cuArrayDestroy(array);
    }
}

(Note that the “JCudaTextureSample” from the JCuda samples page also involves some 2D- and 3D memcopies)

BTW: When you intend to copy device memory into an array, you’ll have to look closely at the specification and usage examples of cudaMallocPitch() and cudaMalloc3D() to ensure that the alignment requirements for the memory are met.

bye
Marco


#3

Hi Marco,

Thanks for the detailed reply, that cleared up much of my confusion on how the arrays are used. I hadn’t realized how they’re tied to textures when accessing them from the kernel, so looking over the texture documentation and the jCuda texture sample has helped too.

I’m still a little confused on how to properly access the array on the kernel. Say I have a data[width][height] array, that I linearlize to data[width*height] using an indexing scheme. I then allocate the cuda array and copy it to the kernel using the code you provided.

I’m assumming to access the data, I need to bind it to a texture reference, and then read it using tex2D(texture,x,y). But, how does the text2D() method know the indexing scheme from the linearlization? Is this inferred from the data, and if so, how do I make sure it’s not inferred wrong? I wasn’t sure how to view the Bucky.raw data in the example to see how the data is read in and set up for the 32x32x32 3D array, so that part is still a bit fuzzy. Is there a particular indexing scheme that needs used when turning a 2D array into a linear array prior to the memcpy2D?

Any suggestions on setting up the linear array and doing the kernel invocation properly will be much appreciated.

Thanks again


#4

Hello

First of all: I’m not a CUDA expert. Most of my CUDA experience consists of … simply porting some samples from CUDA to JCuda :o Thus, I don’t know in detail how the arrays and textures are handled internally, and even many of the parts of the Programming Guide will only become clearer for me once I really find the time to focus on using CUDA for my own development and creating own kernels…

But, how does the text2D() method know the indexing scheme from the linearlization?

Maybe the confusion resulted from the fact that this part was omitted in the example. I think there are just two ways of accessing data via a texture reference: The data may either be a linear block of device data, or an array. When it is given as an array, then I assume that the information that is required for proper indexing is stored together the array itself, namely the information that is given in the CUDA_ARRAY_DESCRIPTOR: The data type, the width, the height etc. When the data is a linear block of device memory, then the essentially the same information is given during the call to “cuTexRefSetAddress2D”, which also requires an CUDA_ARRAY_DESCRIPTOR to be passed in.
So in both cases the relevant information can be associated with the texture reference, and used for proper indexing inside the kernel.

I wasn’t sure how to view the Bucky.raw data in the example to see how the data is read in and set up for the 32x32x32 3D array

I’m not sure what you mean. The Bucky.raw contains 323232 bytes (hence its size of exactly 32KB :D). These are the “density values”. A CUDA array of this specific size is created, and the data is copied to this array at the beginning of the “initCuda” method. The size of the array is specified in the CUDA_ARRAY3D_DESCRIPTOR and the CUDA_MEMCPY3D.

I have to admit that I found it … tricky to get this sample running: Originally, it was written using the Runtime API, which obviously makes the setup of textures and their parameters much easier. But I liked this example when I saw it in the SDK, and for me it was worth the effort :slight_smile:

bye
Marco


#5

Hi Marco,

Thanks again for the reply. I was able to get the tex2D() working with a 2D array – it was just a simple matter of remapping it the same way I originally linearized the array (i.e., me not thinking the problem through originally).

e.g., in kernel.cu:


texture<[B][SIZE=2]float[/SIZE][/B], 2, cudaReadModeElementType> tex; [SIZE=2]// 2D texture[/SIZE]
[LEFT][B][SIZE=2]extern[/SIZE][/B][SIZE=2]"C"[/SIZE] __global__ [B][SIZE=2]void[/SIZE][/B] [B]run[/B]([B][SIZE=2]float[/SIZE][/B] *result) {[/LEFT]

result[blockDim.y * threadIdx.x + threadIdx.y] = tex2D(tex, threadIdx.x, threadIdx.y);[LEFT]}[/LEFT]


[LEFT]In the main method, in addition to the array code you supplied, I just had to add in a section to initialize the texture reference:[/LEFT]


[LEFT][SIZE=2]// Set up texture reference [/SIZE]

[LEFT]CUtexref tex = [B][SIZE=2]new[/SIZE][/B] CUtexref();[/LEFT]

JCudaDriver.[I]cuModuleGetTexRef[/I](tex, module, [SIZE=2]"tex"[/SIZE]);

[LEFT]JCudaDriver.[I]cuTexRefSetFilterMode[/I](tex,CUfilter_mode.[I][SIZE=2]CU_TR_FILTER_MODE_POINT[/SIZE][/I]);
[LEFT]JCudaDriver.[I]cuTexRefSetAddressMode[/I](tex, 0,CUaddress_mode.[I][SIZE=2]CU_TR_ADDRESS_MODE_CLAMP[/SIZE][/I]);
JCudaDriver.[I]cuTexRefSetAddressMode[/I](tex, 1,CUaddress_mode.[I][SIZE=2]CU_TR_ADDRESS_MODE_CLAMP[/SIZE][/I]);
JCudaDriver.[I]cuTexRefSetFormat[/I](tex,CUarray_format.[I][SIZE=2]CU_AD_FORMAT_FLOAT[/SIZE][/I], 1);
JCudaDriver.[I]cuTexRefSetArray[/I](tex, dataArray,JCudaDriver.[I][SIZE=2]CU_TRSA_OVERRIDE_FORMAT[/SIZE][/I]);[/LEFT]
[/LEFT]


[/LEFT]


[LEFT]Thanks again for the help![/LEFT]


#6

OK then, I think I should extend the example by 1D and 3D textures and a simple kernel, to have a small but self-contained example of how to use textures in general. This might be helpful for others (and for me, by the way :wink: )


#7

Hello Marco,
On a related question, I think I am having some problems with cuMemCpy2d in JCuda.
I allocated a pitched device memory using the following code:

long[] pitch = new long[] {0};
cuMemAllocPitch(devAverage, pitch, imageWidth * Sizeof.FLOAT * 4, imageHeight, Sizeof.FLOAT * 4);```

In the kernel I am doing some calculations and I am writing to the elements of this memory using float4 values. I believe the kernel is working properly as I debugged it with Nsight and everything seems to be in order.
However, when I want to fetch my results using the following code

```float[] average = new float[imageWidth * imageHeight * 4];

CUDA_MEMCPY2D copyParam = new CUDA_MEMCPY2D();
copyParam.srcDevice = devAverage;
copyParam.srcPitch = pitch[0];
copyParam.srcMemoryType = CUmemorytype.CU_MEMORYTYPE_DEVICE;

copyParam.dstHost = Pointer.to(average);
copyParam.dstPitch = imageWidth * 4 * Sizeof.FLOAT;
copyParam.WidthInBytes = imageWidth * Sizeof.FLOAT * 4;
copyParam.Height = imageHeight;
copyParam.dstMemoryType = CUmemorytype.CU_MEMORYTYPE_HOST;


cuMemcpy2D(copyParam);```

I am not getting anything in the float[] array. Am I setting some wrong parameters here or should I set extra parameters as well? To be honest, your wrapper for cudaMemCpy2d is very cumbersome to use and I find the documentation very confusing. Of course it may only be me :(  :D

Thanks in advance for your help :-)

#8

Hello

In the code that you posted, everything seems to be OK for me. I created a simple test case, which basically does the same as in your code:

import static jcuda.driver.JCudaDriver.*;
import jcuda.*;
import jcuda.driver.*;
import jcuda.utils.KernelLauncher;

/**
 * A test for the CUDA_MEMCPY2D structure
 */
public class TestMemcpy2D
{
    /**
     * Entry point of this test
     * 
     * @param args Not used
     */
    public static void main(String[] args)
    {
        // Enable exceptions and omit further error tests
        JCudaDriver.setExceptionsEnabled(true);

        // Initialize a KernelLauncer for the test kernel
        KernelLauncher kernelLauncher = 
            KernelLauncher.create("TestMemcpy2D.cu", "testMemcpy2D");

        int imageSizeX = 100;
        int imageSizeY = 100;

        // Allocate pitched memory for the output image that should
        // consist of float4 values
        CUdeviceptr deviceOutput = new CUdeviceptr();
        long[] pitch = new long[] {0};
        cuMemAllocPitch(deviceOutput, pitch, 
            imageSizeX * Sizeof.FLOAT * 4, imageSizeY, Sizeof.FLOAT * 4);

        // Compute the pitch in number of elements - that is, the
        // size of the pitch in bytes, divided by the element size
        int pitchInElements = (int)(pitch[0] / (Sizeof.FLOAT * 4));
        
        // Set up and call the kernel
        int blockSizeX = 16;
        int blockSizeY = 16;
        int gridSizeX = (int)Math.ceil((double)imageSizeX / blockSizeX);
        int gridSizeY = (int)Math.ceil((double)imageSizeY / blockSizeY);
        kernelLauncher.
            setGridSize(gridSizeX, gridSizeY).
            setBlockSize(blockSizeX, blockSizeY, 1).
            call(imageSizeX, pitchInElements, imageSizeY, deviceOutput);
        
        
        // Set up the CUDA_MEMCPY structure and copy the
        // pitched device memory back to the host
        
        CUDA_MEMCPY2D memcpy2D = new CUDA_MEMCPY2D();
        memcpy2D.srcDevice = deviceOutput;
        memcpy2D.srcPitch = pitch[0];
        memcpy2D.srcMemoryType = CUmemorytype.CU_MEMORYTYPE_DEVICE;
         
        float[] hostOutput = new float[imageSizeX * imageSizeY * 4];
        memcpy2D.dstHost = Pointer.to(hostOutput);
        memcpy2D.dstPitch = imageSizeX * 4 * Sizeof.FLOAT;
        memcpy2D.WidthInBytes = imageSizeX * Sizeof.FLOAT * 4;
        memcpy2D.Height = imageSizeY;
        memcpy2D.dstMemoryType = CUmemorytype.CU_MEMORYTYPE_HOST;
         
        cuMemcpy2D(memcpy2D);

        // Print (parts of) the contents of the host memory
        int printedSizeX = imageSizeX;
        int printedSizeY = imageSizeY;
        printedSizeX = Math.min(4, imageSizeX);
        printedSizeY = Math.min(10, imageSizeY);
        for (int y=0; y<printedSizeY; y++)
        {
            for (int x=0; x<printedSizeX; x++)
            {
                int index = (x + y * imageSizeX) * 4;
                float fx = hostOutput[index + 0];
                float fy = hostOutput[index + 1];
                float fz = hostOutput[index + 2];
                float fw = hostOutput[index + 3];
                System.out.print("("+fx+","+fy+","+fz+","+fw+")");
                if (x < printedSizeX - 1)
                {
                    System.out.print(", ");
                }
                else if (printedSizeX < imageSizeX)
                {
                    System.out.print("...");
                }
            }
            System.out.println();
        }
        if (printedSizeY < imageSizeY)
        {
            System.out.print("...");
        }
    }
}
// Kernel for the TestMemcpy2D class: A kernel that receives a memory area 
// (a dummy image), and fills it with float4 values, where each float4 
// contains the x- and y coordinate of the thread that is responsible
// for the respective pixel.
// 'pitchInElements' is the pitch that was returned by cuMemAllocPitch
// divided by the element size (Sizeof.FLOAT * 4)
extern "C"
__global__ void testMemcpy2D(
    int imageWidth, 
    int pitchInElements,
    int imageHeight, float4* output)
{
    int tx=threadIdx.x+blockIdx.x*blockDim.x;
    int ty=threadIdx.y+blockIdx.y*blockDim.y;
    if (tx<imageWidth && ty<imageHeight)
    {
        output[tx+ty*pitchInElements]=make_float4(tx, ty, tx, ty);
    }
}

It…

  • allocates the pitched 2D array
  • passes it to the kernel
    – the kernel fills the array with float4 values that basically contain (tx,ty,tx,ty) where tx/ty are the (global) thread indices
  • initializes the CUDA_MEMCPY2D structure exactly how you also did it
  • copies the data from the device to the host
  • prints (parts of) the contents of the host array
    And this seems to work properly here. Can you confirm that for this test case?

I agree that the CUDA_MEMCPY2D structure (and the 3D one even more) is cumbersome. But I insist that it is not my wrapper that is cumbersome, but the structure itself :stuck_out_tongue_winking_eye: The wrapper is just that: A wrapper. I already mentioned elsewhere that JCuda in its current form originally was mainly intended as the “backend” for an Object-Oriented abstraction layer, but I’m far from being able to create such an OO-layer in view of the high planning- and maintenance effort…

bye
Marco


#9

Hello Marco,

Thank you for the answer :slight_smile:
I am ashamed to say that after testing your code, I realized that there was a slight issue in my kernel code that was causing the error. One can never be too certain about the C-side of the code.
Nevertheless, now everything about the cuMemCpy2d is clear for me and I hope your answer will help other people as well.

Regarding the cumbersomeness, I do realize it and I know that you have been doing your best to keep the 1-1 relation between the API functions and the wrapper. I am forever thankful as your wrapper has made my life much easier and I can only imagine the amount of time that I needed if JCuda never existed!
Me and my professor cannot thank you enough. :slight_smile:


#10

Good to hear that this issue is resolved now. And good to hear that you find JCuda useful :slight_smile:

I know that in general, it is not advisable to transliterate an API from one language to another. Especially in this case: Some aspects of CUDA are very different from how they would be implemented in a real Java Library. And I know that it can be tedious that one always has to wrap float[] arrays into Pointers and specify the data size as “array.length*Sizeof.FLOAT”, while this could easily be done automatically and more conventiently by a method that does simply not accept a Pointer and a size, but the float[] array directly. But when I started JCuda (it actually “emerged” from JCublas and JCufft…) there have been several reasons for me to implement it as a 1:1 mapping, despite the hassle that this may cause. Primarily, the one that I already mentioned: I wanted to map everything from CUDA via JNI to Java as-it-is, in order to have full control about what could be offered by the OO-abstraction layer (last but not least because I had no idea of how an appropriate abstraction layer would look like - I simply could not estimate what might be needed on Java side and what not). I would do some things differently if I had to start it from scratch today, but - now it is what it is, and I certainly won’t throw it away and start anew :wink: