How to re-use global memory between kernel invocations

devmonkey · 8. Juli 2015 um 14:34

Hi,

My usecase is neural net related, I’m new to OpenCL and JOCL so apologies if this is a very basic question.

I have a large sample data set that is copied from the host to the GPU global memory, the sample data does not change over kernel invocations. Then for each training cycle I need to invoke the kernel with a pointer to both the sample data and also a pointer to an array of weights specific to that training cycle also in global memory. After each cycle the host copies back the results, updates the weights and then needs to re-invoke the kernel with the new weights. My question is how do I write the new weights to the global memory allocation which was originally allocated with clCreateBuffer() ?

Thanks, Joe

Marco13 · 8. Juli 2015 um 15:50

Hi

From what you described so far, I don’t see why this should not be possible simply with clEnqueueWriteBuffer. At least, the flow until now sounds roughly like this:

// Initially, create the weights memory object
float weights[] = new float[n];
cl_mem weightsMem = clCreateBuffer(context, CL_MEM_READ_ONLY, 
    weights.length * Sizeof.cl_float, null, null);

// The cycles...
for (int i=0; i<numCycles; i++)
{
    // Use the weights memory object in the kernel:
    clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(weightsMem));
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, 
        null, null, 0, null, null);

    // Modify the weights in host memory
    weights[0] += 123.456;

    // Copy the updated weights data from the host to the memory object:
    clEnqueueWriteBuffer(commandQueue, weightsMem, 
        true, 0, weights.length * Sizeof.cl_float, 
        Pointer.to(weights), 0, null, null);

}

Did I miss something here?

devmonkey · 9. Juli 2015 um 00:27

Thanks Marco, yes this is it. I was confused by an example I had used that only passed the host pointer in the clCreateBuffer() call.

One further question, what effect does clCreateBuffer(…, null, null) actually have? For instance i’m using multiple GPU with a shared context and the GPU allocation cannot be performed by clCreateBuffer() since the call does not know which GPU is will allocate into, is this rather setting up a specification for an allocation that will be used later? If so what does clCreateBuffer(…, CL_MEM_COPY_HOST_PTR, …, ptr, null) actually do as it cannot perform the copy to the GPU?

Thanks, Joe

Marco13 · 9. Juli 2015 um 10:19

[QUOTE=devmonkey]
One further question, what effect does clCreateBuffer(…, null, null) actually have? For instance i’m using multiple GPU with a shared context and the GPU allocation cannot be performed by clCreateBuffer() since the call does not know which GPU is will allocate into, is this rather setting up a specification for an allocation that will be used later? If so what does clCreateBuffer(…, CL_MEM_COPY_HOST_PTR, …, ptr, null) actually do as it cannot perform the copy to the GPU?[/QUOTE]

For the nitty-gritty details about the meaning of the allocation flags, the ultimate reference is the specification. Properly interpreting these flags (and their combinations) is a bit more involved. Of course, websearches bring some results here - e.g. the table here offers some guidelines (but does not include the newest (OpenCL 2.0) flags).

Concerning the more general question of where the memory is allocated: You don’t know. Yes, that’s a bit odd, but OpenCL tries to abstract from the hardware, and from the question of where the memory is actually allocated. When you create a buffer (on a context with multiple devices), then it virtually has to behave as if the memory was allocated on all devices (and it might be, in fact). For a more fine-grained control, clEnqueueMigrateMemObjects was added in CL 1.2.

I did some basic tests for this when I had access to a 2-GPU-machine a few years ago, but … this was not my machine, but a special one at my office at work, so I did not have the chance to make a more detailed analysis. But IIRC, I was able to observe effects that indicated that some sort of “synchronization” took place (i.e. I noticed timing effects that could only be explained by the OpenCL implementation copying modified buffers back and forth between the two GPUs). Due to this limited experience, I can not give a profound advice about “best practices” here in the context of larger applications.