Create 3D array, calculate and view result

Hi.

Im trying to create an 3D array and then having the kernel to iterate over it and do some calculations.

Its like a cube formated as 10x10x10 positions.

As a template I using Marco13s very good 2D example that Im trying to convert into 3D.

As of now the the kernel just calculates like every 100th value for some reason. I have been using different values in a[] (see kernel below)
but I not sure what the correct is. I need to get values between 0-999 but I cant figure out how globalId can do that when all three get_global_id() needs to be combined.

The kernel look like

__kernel void sampleKernel(__global float *a)
{
   int gidX = get_global_id(0);
   int gidY = get_global_id(1);
   int gidZ = get_global_id(2);
   a[???] *= 2;  //just doubles the values to get the example to work
}

Creating the 3D array and also filling it with data(exluded in this post)
int arrayXLength = 10;
int arrayYLength = 10;
int arrayZLength = 10;
float array[][][] = new float[arrayXLength][arrayYLength][arrayZLength];

Im allocating memory as:
cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * arrayXLength * arrayYLength * arrayZLength, null, null);

Im executing the kernel as:
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
clEnqueueNDRangeKernel(commandQueue, kernel, 2, null, new long[]{arrayXLength + ??? + ??? + ???}, null, 0, null, null);

Im writing and reading to buffer as below, I guess there are some issues here with the loop and arrays length :slight_smile:


	private static void writeBuffer3D(cl_command_queue commandQueue, cl_mem buffer, float array[][][])
	{
		long byteOffset = 0;
		for (int y = 0; y < array.length; y++)
		{
			for (int x = 0; x < array.length; x++)
			{	
				int bytes = array[x][y].length * Sizeof.cl_float;			
				clEnqueueWriteBuffer(commandQueue, buffer, CL_TRUE, byteOffset, bytes, Pointer.to(array[y][x]), 0, null, null);
				byteOffset += bytes;	
			}
		}
	}

	private static void readBuffer3D(cl_command_queue commandQueue, cl_mem buffer, float array[][][])
	{
		long byteOffset = 0;		
		for (int y = 0; y < array.length; y++)
		{
			for (int x = 0; x < array.length; x++)
			{			
				int bytes = array[x][y].length * Sizeof.cl_float;	
				clEnqueueReadBuffer(commandQueue, buffer, CL_TRUE, byteOffset,  bytes,Pointer.to(array[y][x]), 0, null, null);
				byteOffset += bytes;
			}
		}
	}

Anyone got some good tips how I can solve this?

Thanks

//Fredrik

Similar questions are asked occasionally, here and at other places, for JOCL and JCuda. A recent example, which also links to a stackoverflow answer - they both refer to JCuda, but the issues are the same. I should probably create some FAQ where I explain this and other points in more detail…

First, you should consider to store your data as a 1D array on Java side as well. This would allow you to copy the data as one big “chunk” in one run. Imagine an array like float array[][][] = new float[1000][1000][2]; : You would have to run through the first 2 dimensions (1000*1000) with two nested for-loops, and then issue 1 Million calls to clEnqueueWriteBuffer - each only copying 2 float values. Not only would the performance be horrible, it’s also error-prone because you have to manually fiddle around with the offsets to make sure that the 2 floats from array[123][453][k] end at the right position in the cl_mem buffer. (The copying code that you posted is not correct in this sense. If you still want to use “real” multidimensional arrays on Java side, I could post an example for this, but I’d not recommend it)

(BTW: Depending on what exactly you want to achieve, and depending on how much work you will do with this array on Java side, you could consider wrapping this 1D array into a class that “looks like” a 3D array for Java, to make it a bit more convenient - roughly something like

class Array3D {
    private final float array[];
    private final int sizeX, sizeY, sizeZ;
    ...
    void set(int x, int y, int z, float value) { ... }
    float get(int x, int y, int z) { ... }

but of course, one could go very far here in terms of abstraction and functionality).

The index computation is then the same on Java side and inside the kernel. That is, computing the 1D-index from the 3D coordinates is the same. The main crucial difference here is whether the data is accessed lexicographically or colexicographically - more commonly known as Row-major vs. Column-major order (although these terms intuitively only refer to 2D data).

Basically, the question is whether, for the 2D case, the elements of a matrix are enumerated


// Row-major
0 1 2 
3 4 5

or


// Column-major
0 2 4 
1 3 5

This may heavily affect performance (particularly, on Java side). When you in general have loops like

for (int x=0; x<sizeX; x++) {
    for (int y=0; y<sizeY; y++) {
        for (int z=0; z<sizeZ; z++) {
            // do something with array at (x,y,z)
        }
    }
}

then you should make sure that the innermost loop causes the smallest steps in the memory.

This can be achieved as follows:

// Coordinates obtained with get_global_id, or with for-loops on Java side:
int x = ...;
int y = ...;
int z = ...;

// Compute the 1D index:
int index = z + (y * sizeZ) + (x * sizeY * sizeZ);

// Access the array with this 1D index
array[index] = ...;  

I’d recommend to pass the array size along the 3 dimensions to the kernel as well. (Some information could be derived inside the kernel as well, but this is a bit fragile, because you might later add some padding or so). So the kernel code could roughly be

__kernel void sampleKernel(__global float *a, int sizeX, int sizeY, int sizeZ)
{
    int x = get_global_id(0);
    int y = get_global_id(1);
    int z = get_global_id(2);
    if (x >= sizeX || y >= sizeY || z >= sizeZ) 
    {
        return; 
    }
    int index = z + (y * sizeZ) + (x * sizeY * sizeZ);
    a[index] *= 2; 
}

(Note: The int values should actually be of type size_t (which is a long on Java side), but to keep it similar to the code that you posted)

(BTW: From a given 1D index, you can also compute the 3 indices for the 3D array, if you need to)

Then, the remaining question is the invocation of the kernel. Make sure to pass 3 as the work_dim parameter to the call (you seemed to use 2 there). The global_work_size is then simply an array containing the sizes of the array, along the respective dimensions:

clEnqueueNDRangeKernel(commandQueue, kernel, 
    3, null, new long[]{sizeX, sizeY, sizeZ}, null, 0, null, null);

Hi.

You are making some very good suggestions here :slight_smile:

To go with 1D in both kernel and java seems like a lot easier. Then just implement a 3D method in java I can use.
Like getData(int z, int y, int z)

Many thanks

//Fredrik

Hi. Just got back from work.

Just created the method as your solution:

	{		
		if(x > 9 || y > 9 ||z > 9)
		{
			throw new IllegalArgumentException("Dimension is 0-9 in each 3D direction");
		}
		return array[x + (y * arrayXLength) + (z * arrayYLength * arrayZLength)];		
	}```

Not much more needed if converting 2D to 3D array usage (in my case)

Apart from non-OpenCL-related issues, this should be fine.

(I’d write it like this

private static void check(String name, int index, int size) 
{
    if (index < 0 || index >= size)
    {
        throw new IndexOutOfBoundsException(
            "The "+name+" coordinate must be positive and smaller than "+size+", but is "+index);
    }
}

private float getData(float[] array, int x, int y, int z)
{
    check("x", x, arrayXlength);
    check("y", y, arrayYlength);
    check("z", z, arrayZlength);
    ....
}

because this is the right thing to check, an Exception should be of the right type, and the error message should help the receiver to figure out what he did wrong, but … well)