How to use clCreateSubBuffer?

system · 3. Februar 2012 um 06:17

Hi Marco,

Can you please give me an example of how to use org.jocl.CL.clCreateSubBuffer?

Thank You!
Roland.

Marco13 · 3. Februar 2012 um 07:51

Hello,

I’ll post an example on sunday or monday, I think I created one when I tested the new OpenCL 1.1 functions

bye
Marco

Marco13 · 5. Februar 2012 um 12:42

OK, I did not test this function extensively :o otherwise I would have noticed that there are two minor issues: The first is that the creation flags should be a ‘long’ value (but this is trivial to fix in the next release). The other one is that the creation of the ‘buffer_create_info’ structure may be a hassle when trying to write it in the most generic way. The ‘buffer_create_info’ is a pointer to a structure that contains two size_t values:


typedef struct_cl_buffer_region {
    size_t origin;
    size_t size;
}cl_buffer_region;

At the moment, this is simply treated as a usual pointer. That means that one has to check on Java side what the size of a size_t actually is. I have attached an example how this may be done (I’ll have to verify this on a 64bit machine), but I’ll try to find a simpler and more intuitive solution to handle this.

/*
 * JOCL - Java bindings for OpenCL
 * 
 * Copyright 2012 Marco Hutter - http://www.jocl.org/
 */
package org.jocl.samples;

import static org.jocl.CL.*;

import java.nio.*;
import java.util.*;

import org.jocl.*;


/**
 * A sample demonstrating how to create sub-buffers
 * that have been introduced with OpenCL 1.1.
 */
public class JOCLSubBufferSample
{
    private static cl_context context;
    private static cl_command_queue commandQueue;

    /**
     * The entry point of this sample
     * 
     * @param args Not used
     */
    public static void main(String args[])
    {
        simpleInitialization();
        
        // Create an array with 8 elements and consecutive values
        int fullSize = 8;
        float fullArray[] = new float[fullSize];
        for (int i=0; i<fullSize; i++)
        {
            fullArray** = i;
        }
        System.out.println("Full input array  : "+Arrays.toString(fullArray));
        
        // Create a buffer for the full array
        cl_mem fullMem = clCreateBuffer(context, 
            CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, 
            Sizeof.cl_float * fullSize, Pointer.to(fullArray), null);

        // Create a sub-buffer
        int subOffset = 2;
        int subSize = 4;
        cl_mem subMem = clCreateSubBuffer(fullMem, 
            (int)CL_MEM_READ_WRITE, CL_BUFFER_CREATE_TYPE_REGION, 
            createInfo(subOffset, subSize, Sizeof.cl_float), null);

        // Create an array for the sub-buffer, and copy the data
        // from the sub-buffer to the array
        float subArray[] = new float[subSize];
        clEnqueueReadBuffer(commandQueue, subMem, true, 
            0, subSize * Sizeof.cl_float, Pointer.to(subArray), 
            0, null, null);
        
        System.out.println("Read sub-array    : "+Arrays.toString(subArray));

        // Modify the data in the sub-array, and copy it back
        // into the sub-buffer
        subArray[0] = -5;
        subArray[1] = -4;
        subArray[2] = -3;
        subArray[3] = -2;
        clEnqueueWriteBuffer(commandQueue, subMem, true, 
            0, subSize * Sizeof.cl_float, Pointer.to(subArray), 
            0, null, null);

        System.out.println("Modified sub-array: "+Arrays.toString(subArray));
        
        // Read the full buffer back into the array 
        clEnqueueReadBuffer(commandQueue, fullMem, true, 
            0, fullSize * Sizeof.cl_float, Pointer.to(fullArray), 
            0, null, null);
        
        System.out.println("Full result array : "+Arrays.toString(fullArray));
        
    }
    
    /**
     * Create a pointer to a 'buffer_create_info' struct for the
     * {@link CL#clCreateSubBuffer(cl_mem, int, int, Pointer, int[])}
     * call
     * 
     * @param offset The sub-buffer offset, in number of elements
     * @param size The sub-buffer size, in number of elements
     * @param The size of the buffer elements (e.g. Sizeof.cl_float)
     * @return The pointer to the buffer creation info
     */
    private static Pointer createInfo(long offset, long size, int elementSize)
    {
        // The 'buffer_create_info' is a struct with two size_t
        // values on native side. This is emulated with a 
        // byte buffer of the appropriate size
        ByteBuffer createInfo = 
            ByteBuffer.allocate(2 * Sizeof.size_t).order(
                ByteOrder.nativeOrder());
        if (Sizeof.size_t == Sizeof.cl_int)
        {
            createInfo.putInt(0, (int)offset * elementSize); 
            createInfo.putInt(Sizeof.size_t , (int)size * elementSize);
        }
        else
        {
            createInfo.putLong(0, offset * elementSize); 
            createInfo.putLong(Sizeof.size_t , size * elementSize);
        }
        return Pointer.to(createInfo);
    }
    
    /**
     * Simple OpenCL initialization of the context and command queue
     */
    private static void simpleInitialization()
    {
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
        
        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];
        
        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        cl_device_id device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);
        
        // Create a command-queue
        commandQueue = 
            clCreateCommandQueue(context, devices[0], 0, null);
    }
    
    
}

system · 6. Februar 2012 um 03:28

Hello Marco,

Your example helped me a lot! Thank You!

Roland.

Marco13 · 6. Februar 2012 um 06:20

OK, the “createInfo” method was… -_- ehrm… could be simplified to

    private static Pointer createInfo(long offset, long size, int elementSize)
    {
        if (Sizeof.size_t == Sizeof.cl_int)
        {
            return Pointer.to(new int[] { 
                (int)offset * elementSize,
                (int)size * elementSize
            });
        }
        else
        {
            return Pointer.to(new long[] { 
                offset * elementSize,
                size * elementSize
            });
        }
    }

but this is still not so nice (and also not yet tested on 64bit). Maybe I’ll create a “cl_buffer_region” class that would allow to pass this info to the native side without having to check for the size_t size on Java side.

joedizzle · 12. November 2023 um 21:46

Sorry for reviving an old thread. This doesn’t seem to work for certain devices such as AMD. It produces CL_MISALIGNED_SUB_BUFFER_OFFSET error. But works well with Intel GPUs.

This might be lack of use of

long [] align = new long[1];
CL.clGetDeviceInfo(device, CL.CL_DEVICE_MEM_BASE_ADDR_ALIGN, Sizeof.size_t, Pointer.to(align), null);

But not sure how this can be applied to offset and size calculations. Might there be a way?

Marco13 · 13. November 2023 um 14:52

@joedizzle

Not sure about that. I once had OpenCL for AMD CPU set up here, but right now, only have it for NVIDIA GPUs. Time flies…

What does it print for you when you print the align[0] as it is obtained with your code snippet?

(For me it prints 4096, which looks a bit odd: Whatever the alignment of the main buffer it: The offset of the sub-buffer in the sample is certainly not aligned to 4096 bytes, so there’s some guesswork involved about what is actually causing this error…)

joedizzle · 13. November 2023 um 15:48

After scouring the internet, found that this actually happens to AMD drivers mostly. I believe GPUs - not sure about CPUs. For my case align[0] gives 2048. This is a bit type, which translates to 256 bytes. For an array of int values, the clCreateSubBuffer offset has to be a multiple of 64 (since cl_int is 4 bytes). Starting offset at 64, 128, 256… in which the program works well. But this loses the arbitrary indices offsets that we are accustomed with pointer slicing of arrays. I believe Nvidia devices have no issues with this.

This discussion can give a glimpse on the issue… clCreateSubBuffer Error 13 only on AMD

Such a bummer now that I need to now to pass an offset value to a kernel.

Marco13 · 14. November 2023 um 00:47

So, I’ll probably have to re-read parts of the documentation here. When you say

When you say that the CL_DEVICE_MEM_BASE_ADDR_ALIGN is 256 bytes for you, then it’s not clear why it works with 64 or 128 bytes as well - i.e. I’m not sure where this alignment requirement actually comes into play.

There’s probably a deep technical reason for this requirement, and … usually, when there’s an obscure requirement in GPU computing, then it’s likely justified with *jazz hands* something about ‚performance‘. Using an offset inside the kernel might then cause some sort of „misaligned access“ to happen. But these are just guesses for now.

joedizzle · 14. November 2023 um 18:58

Sorry. I think my articulation is bad. What I mean is that for align[0], it gives 2048 bits for my AMD GPU. This translates to 256 bytes. Therefore, for an array of ints, the offset index being accepted are array[64], array[128], array[256], … which are created from createSubBuffer[64 * 4], createSubBuffer[128 * 4], createSubBuffer[256 * 4], … This only happens with AMD drivers.

Quite mind boggling. Anyway, now I’m appreciating why shared virtual memory were introduced in OpenCL 2.0.

usually, when there’s an obscure requirement in GPU computing, then it’s likely justified with jazz hands something about ‚performance‘.

This is so true. All the reasons I’m finding is because of this.