Pass array of vectors as kernel args

I’m trying to build a simple gpu-accelerated raytracer using OpenCL and JOCL. I’ve implemented the raytracing algorithm in java and tested it, works fine so far. Now I want to port it to OpenCL C. My code heavily makes use of the vector functions in OpenCL on float3’s, so I would like to use them directly. However, I can’t figure out how to pass a vector to the kernel. I’ve read that you should pass them as arrays, and that float3 also needs an array of 4 elements, but nothing seems to help.

This is my (testing) kernel:

__kernel void raycast(__global const float3* vec) {
        const int gid = get_global_id(0);
        printf("Vec: %d\n", vec[gid]);
        printf("Components: [%d|%d|%d|%d]", vec[gid].x, vec[gid].y, vec[gid].z);

I create the pointer to the vector array:

Pointer elementPointer = float[] { 1f, 2f, 3f, 0f });
Pointer pointer = Pointer[] { elementPointer });

Then I create a buffer for it:

cl_mem mem = CL.clCreateBuffer(context, CL.CL_MEM_READ_ONLY | CL.CL_MEM_COPY_HOST_PTR,
                               Sizeof.POINTER, pointer, null);

Finally, setting the kernel args:

CL.clSetKernelArg(kernel, 0, Sizeof.cl_mem,;

Executing this results in the following output (it is consistent across multiple executions):

Vec: 826450672
Components: [0|1042830430|0]

I’ve also tried using the elementPointer directly and specifying the size Sizeof.cl_float3, but no difference.

You are using %d as the format for the printf call. But %d is for integer values. For floating point values, it has to be %f. Also, you’re trying to print four numbers, while there are only three.

So the line should be

printf("Components: [%f|%f|%f]", vec[gid].x, vec[gid].y, vec[gid].z);

(And you should remove this printf("Vec: %d\n", vec[gid]);, because this is wrong for the same reason, and you cannot print a full vector like that…)

I assume that you may have spent some time with „trial and error“ because you received unexpected output, and this might be the reason why you got some pointer-indirections in the host code wrong: The pointer that is passed to clCreateBuffer should be the elementPointer directly (and not a pointer to that pointer). Of course, the size for the memory copy then has to be updated accordingly: The size should be Sizeof.cl_float * numberOfFloatValues.

Quickly combined into an example (just quickly written down to show the main points) :

__kernel void raycast(__global const float3* vec) {
    const int gid = get_global_id(0);
    printf("Global index %d, components: [%f|%f|%f]\n", 
        gid, vec[gid].x, vec[gid].y, vec[gid].z);

package org.jocl.test;

import static org.jocl.CL.*;

import org.jocl.*;

import java.nio.file.Files;
import java.nio.file.Paths;

public class JOCLRaycastTest
    private static final int platformIndex = 0;
    private static final long deviceType = CL_DEVICE_TYPE_ALL;
    private static final int deviceIndex = 0;
    private static cl_context context;
    private static cl_command_queue commandQueue;
    private static cl_kernel kernel;
    public static void main(String args[])
        // Default initialization
        int numVectors = 5;
        // This is 4 even for 3D vectors, because cl_float3=cl_float4!
        int numComponentsPerVector = 4;
        // Create the array and fill it with example data
        float[] array = new float[numVectors * numComponentsPerVector];
        for (int i = 0; i < array.length; i++)
            array[i] = i;
        int totalSizeOfData = Sizeof.cl_float * array.length;
        // Create the memory object for the array
        Pointer elementPointer =;
        cl_mem mem = clCreateBuffer(context, 
            totalSizeOfData, elementPointer, null);
        clSetKernelArg(kernel, 0, Sizeof.cl_mem,;
        // Set the work-item dimensions
        long global_work_size[] = new long[]{ numVectors };
        long local_work_size[] = new long[]{ numVectors };
        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, local_work_size, 0, null, null);
    private static void defaultInitialization()
        String programSource = readFile("");

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        // Create a command-queue for the selected device
        cl_queue_properties properties = new cl_queue_properties();
        commandQueue = clCreateCommandQueueWithProperties(
            context, device, properties, null);
        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
            1, new String[]{ programSource }, null, null);
        // Build the program
        clBuildProgram(program, 0, null, null, null, null);
        // Create the kernel
        kernel = clCreateKernel(program, "raycast", null);
    private static String readFile(String fileName)
            return new String(Files.readAllBytes(Paths.get(fileName)));
        catch (IOException e)
            return null;

The output should be

Global index 0, components: [0.000000|1.000000|2.000000]
Global index 1, components: [4.000000|5.000000|6.000000]
Global index 2, components: [8.000000|9.000000|10.000000]
Global index 3, components: [12.000000|13.000000|14.000000]
Global index 4, components: [16.000000|17.000000|18.000000]


You might notice that some values appear to be missing in the output. It goes 0,1,2, 4,5,6 .... As you mentioned, this is because cl_float3 is the same as cl_float4 in OpenCL. This means that the array does not contain numVectors*3, but numVectors*4 floating point values (and some of them are simply not used).

It may be a matter of taste what someone finds more confusing:

  • having to deal with 4xn-sized arrays, even though it says cl_float3 OR
  • using cl_float4 even though one only needs cl_float3 in the kernel

If you want to, you could also simply replace all float3's with float4s (also on the host side), and write the kernel as

__kernel void raycast(__global const float4* vec) {
    const int gid = get_global_id(0);
    printf("Global index %d, components: [%f|%f|%f|%f]\n", 
        gid, vec[gid].x, vec[gid].y, vec[gid].z, vec[gid].w);

which would print

Global index 0, components: [0.000000|1.000000|2.000000|3.000000]
Global index 1, components: [4.000000|5.000000|6.000000|7.000000]
Global index 2, components: [8.000000|9.000000|10.000000|11.000000]
Global index 3, components: [12.000000|13.000000|14.000000|15.000000]
Global index 4, components: [16.000000|17.000000|18.000000|19.000000]

Thanks a lot, that fixed it! The idea of passing all float3s in a single array really did it for me. Also, sorry for getting the printf so wrong, I never used anything C-like before and I’m never using it in Java.

I see, you already fixed the pointer size from your initial message to Sizeof.cl_float instead of Sizeof.float3 times the array size.

Yes, it could either be Sizeof.cl_float3 * numVectors, or Sizeof.cl_float * array.length, where array.length==(numVectors * numComponentsPerVector). I mixed it up first, but fixed it quickly.

(The fact that cl_float3==cl_float4 can be confusing, when accidentally making the (wrong) asssumption that a cl_float3 was just 3 float components…)