2D array/texture reference with float4

Hi all,

Similar to the previous post on arrays and texture references, I’m trying to get a 2D array working with a texture reference, but where each element is a float4. The program compiles and runs, but my output is not the same as the input. I’m not sure if I am copying the data into the 2D array incorrectly or not, but it looks like something is wrong with my indexing scheme, although I can’t figure it out. The code worked fine with a regular 2D array where numChannels=1, but when I added in the float4 data to the input array (as the last array dimension) and set numChannels=4, the indexing and/or copying is wrong. I couldn’t find any useful examples of this, but i’m guessing it’s something with how i’m setting up my input array and copying it over to a 2D cuda array. Maybe I have to use a 3D array copy if I add in the extra float4 dimension? Here’s the code and output if anyone has any suggestions:

Main method:


[LEFT][SIZE=2]// Initialize the driver and create a context for the first device. [/LEFT]
[/SIZE][LEFT]JCudaDriver.**setExceptionsEnabled**(**[SIZE=2]true**[/SIZE]);
JCudaDriver.**cuInit**(0); 
CUcontext pctx = **[SIZE=2]new**[/SIZE] CUcontext(); 
CUdevice dev = **[SIZE=2]new**[/SIZE] CUdevice(); 
JCudaDriver.**cuDeviceGet**(dev, 0); 
JCudaDriver.**cuCtxCreate**(pctx, 0, dev); 
String cubinFileName = JCudaUtils.**prepareCubinFile**([SIZE=2]"cudaFiles/TestArrayTexture_float4.cu"[/SIZE]);
CUmodule module = **[SIZE=2]new**[/SIZE] CUmodule();
JCudaDriver.**cuModuleLoad**(module, cubinFileName);
CUfunction function = **[SIZE=2]new**[/SIZE] CUfunction();
JCudaDriver.**cuModuleGetFunction**(function, module, [SIZE=2]"run"[/SIZE]);

[SIZE=2]//ROW MAJOR ORDER: linear index = row*NUMCOLS + column[/LEFT]
[/SIZE][LEFT][SIZE=2]//NOTE: x refers to column, y refers to rows; so arrays should be of form array[y][x] / array**[j] / array[row][column] where i=y=row, j=x=column[/LEFT]
[/SIZE][LEFT]
**[SIZE=2]int**[/SIZE] blocks = 1;
**[SIZE=2]int**[/SIZE] height = 3;
**[SIZE=2]int**[/SIZE] width = 2; 
**[SIZE=2]int**[/SIZE] channels = 4;
**[SIZE=2]int**[/SIZE] threadsPerBlock = width*height*channels; 
**[SIZE=2]int**[/SIZE] size = blocks*threadsPerBlock*channels;
**[SIZE=2]float**[/SIZE] output[] = **[SIZE=2]new**[/SIZE] **[SIZE=2]float**[/SIZE][height*width*channels];
**[SIZE=2]float**[/SIZE] input[][][] = **[SIZE=2]new**[/SIZE] **[SIZE=2]float**[/SIZE][height][width][channels];
**[SIZE=2]float**[/SIZE][] input1D = **[SIZE=2]new**[/SIZE] **[SIZE=2]float**[/SIZE][height*width*channels]; 
**[SIZE=2]int**[/SIZE] counter = 0; [/LEFT]
[LEFT]**[SIZE=2]for**[/SIZE] (**[SIZE=2]int**[/SIZE] i=0; i<height; i++){
**[SIZE=2]for**[/SIZE] (**[SIZE=2]int**[/SIZE] j=0; j<width; j++){
**[SIZE=2]for**[/SIZE] (**[SIZE=2]int**[/SIZE] k=0; k<channels; k++){
input**[j][k] = counter++;
input1D[k + j*channels + i*width*channels] = input**[j][k];
}
}
}

 

System.**[SIZE=2]out**[/SIZE].println([SIZE=2]"
Linear array: "[/SIZE] + Arrays.**toString**(input1D));

 
[SIZE=2]//INitialize array[/LEFT]
[/SIZE][LEFT]CUarray dataArray = **[SIZE=2]new**[/SIZE] CUarray(); 
CUDA_ARRAY_DESCRIPTOR desc = **[SIZE=2]new**[/SIZE] CUDA_ARRAY_DESCRIPTOR(); 
desc.[SIZE=2]Format[/SIZE] = CUarray_format.**[SIZE=2]CU_AD_FORMAT_FLOAT**[/SIZE]; 
desc.[SIZE=2]NumChannels[/SIZE] = 4; 
desc.[SIZE=2]Width[/SIZE] = width; 
desc.[SIZE=2]Height[/SIZE] = height; 
**cuArrayCreate**(dataArray, desc);

[SIZE=2]// Copy the volume data data to the 2D array[/LEFT]
[/SIZE][LEFT]CUDA_MEMCPY2D copy = **[SIZE=2]new**[/SIZE] CUDA_MEMCPY2D();
copy.[SIZE=2]srcMemoryType[/SIZE] = CUmemorytype.**[SIZE=2]CU_MEMORYTYPE_HOST**[/SIZE];
copy.[SIZE=2]srcHost[/SIZE] = Pointer.**to**(input1D);
copy.[SIZE=2]srcPitch[/SIZE] = width*Sizeof.**[SIZE=2]FLOAT**[/SIZE];
copy.[SIZE=2]dstMemoryType[/SIZE] = CUmemorytype.**[SIZE=2]CU_MEMORYTYPE_ARRAY**[/SIZE];
copy.[SIZE=2]dstArray[/SIZE] = dataArray;
copy.[SIZE=2]dstPitch[/SIZE] = width;
copy.[SIZE=2]WidthInBytes[/SIZE] = width*Sizeof.**[SIZE=2]FLOAT**[/SIZE];
copy.[SIZE=2]Height[/SIZE] = height;
JCudaDriver.**cuMemcpy2D**(copy);

[SIZE=2]// Set up texture reference [/LEFT]
[/SIZE][LEFT]CUtexref tex = **[SIZE=2]new**[/SIZE] CUtexref();
JCudaDriver.**cuModuleGetTexRef**(tex, module, [SIZE=2]"tex"[/SIZE]);
JCudaDriver.**cuTexRefSetFilterMode**(tex,CUfilter_mode.**[SIZE=2]CU_TR_FILTER_MODE_POINT**[/SIZE]);
JCudaDriver.**cuTexRefSetAddressMode**(tex, 0,CUaddress_mode.**[SIZE=2]CU_TR_ADDRESS_MODE_CLAMP**[/SIZE]);
JCudaDriver.**cuTexRefSetAddressMode**(tex, 1,CUaddress_mode.**[SIZE=2]CU_TR_ADDRESS_MODE_CLAMP**[/SIZE]);
JCudaDriver.**cuTexRefSetFormat**(tex,CUarray_format.**[SIZE=2]CU_AD_FORMAT_FLOAT**[/SIZE], 4);
JCudaDriver.**cuTexRefSetArray**(tex, dataArray,JCudaDriver.**[SIZE=2]CU_TRSA_OVERRIDE_FORMAT**[/SIZE]);

[SIZE=2]// Set up and call function[/LEFT]
[/SIZE][LEFT]JCudaDriver.**cuFuncSetBlockShape**(function, width, height, 1);

CUdeviceptr outputPtr = **[SIZE=2]new**[/SIZE] CUdeviceptr();
**cuMemAlloc**(outputPtr, size * Sizeof.**[SIZE=2]FLOAT**[/SIZE]);
Pointer dOut = Pointer.**to**(outputPtr);

**[SIZE=2]int**[/SIZE] offset = 0;
offset = JCudaDriver.**align**(offset, Sizeof.**[SIZE=2]POINTER**[/SIZE]);
JCudaDriver.**cuParamSetv**(function, offset, dOut, Sizeof.**[SIZE=2]POINTER**[/SIZE]);
offset += Sizeof.**[SIZE=2]POINTER**[/SIZE];

JCudaDriver.**cuParamSetSize**(function, offset);
JCudaDriver.**cuLaunch**(function);
JCudaDriver.**cuCtxSynchronize**();

[SIZE=2]// Copy the result from the device to the host[/LEFT]
[/SIZE][LEFT]JCudaDriver.**cuMemcpyDtoH**(Pointer.**to**(output), outputPtr, size * Sizeof.**[SIZE=2]FLOAT**[/SIZE]);
System.**[SIZE=2]out**[/SIZE].println([SIZE=2]"Result:	"[/SIZE]+Arrays.**toString**(output));

**[SIZE=2]boolean**[/SIZE] passed = **[SIZE=2]true**[/SIZE]; 
**[SIZE=2]for**[/SIZE] (**[SIZE=2]int**[/SIZE] i=0; i<width*height*channels; i++){ 
**[SIZE=2]if**[/SIZE] (input1D** != output**){ 
passed = **[SIZE=2]false**[/SIZE]; 
**[SIZE=2]break**[/SIZE]; 
} 
} 
System.**[SIZE=2]out**[/SIZE].println([SIZE=2]"Test "[/SIZE]+(passed?[SIZE=2]"PASSED"[/SIZE]:[SIZE=2]"FAILED"[/SIZE])); 

[SIZE=2]// Clean up. [/LEFT]
[/SIZE][LEFT]JCudaDriver.**cuArrayDestroy**(dataArray); 
**cuMemFree**(outputPtr);[/LEFT]
}

Kernel code:


[LEFT]texture<float4, 2, cudaReadModeElementType> tex; [SIZE=2]// 2D texture[/LEFT]
[/SIZE][LEFT][/LEFT]
**[SIZE=2][LEFT]extern**[/SIZE] [SIZE=2]"C"[/SIZE] __global__ **[SIZE=2]void**[/SIZE] **run**(**[SIZE=2]float**[/SIZE] *result) {

**[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] rowId = threadIdx.x;
**[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] columnId = threadIdx.y;

float4 data = tex2D(tex, threadIdx.x, threadIdx.y);

**[SIZE=2]const**[/SIZE] **[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] id0 = 0 + threadIdx.x*4 + threadIdx.y * blockDim.x*4; [SIZE=2][/LEFT]
[/SIZE][LEFT]**[SIZE=2]const**[/SIZE] **[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] id1 = 1 + threadIdx.x*4 + threadIdx.y * blockDim.x*4;
**[SIZE=2]const**[/SIZE] **[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] id2 = 2 + threadIdx.x*4 + threadIdx.y * blockDim.x*4;
**[SIZE=2]const**[/SIZE] **[SIZE=2]unsigned**[/SIZE] **[SIZE=2]int**[/SIZE] id3 = 3 + threadIdx.x*4 + threadIdx.y * blockDim.x*4;

result[id0] = data.x;
result[id1] = data.y;
result[id2] = data.z;
result[id3] = data.w;[/LEFT]
}

Output:


[LEFT]Linear array: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0]
Result: [0.0, 1.0, 2.0, 3.0, 5.0489216E-29, 3.4124506E-24, 7.952046E-34, 9.89932E-32, 2.0, 3.0, 6.0, 7.0, 9.845339E-38, 2.2811966E-36, 1.9690824E-37, 1312.315, 4.0, 5.0, 8.0, 3.673718E-38, 4.473459E-19, 9.403955E-38, 5.075095E-38, 7.61948E-40][/LEFT]
Test FAILED

[LEFT]Thanks
[/LEFT]

Hello

I remember similar issues when porting the volume render example: One array which should have been a 1D array with 4 channels had to be treated as a 2D array to get it working. Thus, in your example, it might work when daclaring the 2D array of float4s as a 3D array, but of course, this technique would find its limit when trying to handle a 3D array with >1 channels - and in any case, such “workarounds” should not be necessary.
Unfortunately I can not test this at the moment, but when I’m back at my home PC (probably by next weekend) I’ll investigate this further and try to find a more general solution and provide an example, if possible

bye
Marco

Hello

I have uploaded an example that shows how to access 1D, 2D and 3D textures of float and float4 values via texture references. http://jcuda.org/samples/samples.html#JCudaDriverTextureTest.

I think one (maybe the main) error in the code you posted was that


copy.srcPitch = width*Sizeof.FLOAT;
...
copy.WidthInBytes = width*Sizeof.FLOAT;

should have been


copy.srcPitch = width*Sizeof.FLOAT * channels;
...
copy.WidthInBytes = width*Sizeof.FLOAT * channels;

since float4 values should be read.

bye
Marco

Marco,

Thank you for uploading the different float4 examples – they will be very useful!

The main error was with the CUDA_MEMCPY2D where I needed to include " *channels "; after I put that in there, it worked fine.

Thanks!