One further question, what effect does clCreateBuffer(…, null, null) actually have? For instance i’m using multiple GPU with a shared context and the GPU allocation cannot be performed by clCreateBuffer() since the call does not know which GPU is will allocate into, is this rather setting up a specification for an allocation that will be used later? If so what does clCreateBuffer(…, CL_MEM_COPY_HOST_PTR, …, ptr, null) actually do as it cannot perform the copy to the GPU?[/QUOTE]
For the nitty-gritty details about the meaning of the allocation flags, the ultimate reference is the specification. Properly interpreting these flags (and their combinations) is a bit more involved. Of course, websearches bring some results here - e.g. the table here offers some guidelines (but does not include the newest (OpenCL 2.0) flags).
Concerning the more general question of where the memory is allocated: You don’t know. Yes, that’s a bit odd, but OpenCL tries to abstract from the hardware, and from the question of where the memory is actually allocated. When you create a buffer (on a context with multiple devices), then it virtually has to behave as if the memory was allocated on all devices (and it might be, in fact). For a more fine-grained control, clEnqueueMigrateMemObjects was added in CL 1.2.
I did some basic tests for this when I had access to a 2-GPU-machine a few years ago, but … this was not my machine, but a special one at my office at work, so I did not have the chance to make a more detailed analysis. But IIRC, I was able to observe effects that indicated that some sort of “synchronization” took place (i.e. I noticed timing effects that could only be explained by the OpenCL implementation copying modified buffers back and forth between the two GPUs). Due to this limited experience, I can not give a profound advice about “best practices” here in the context of larger applications.