CL_KERNEL_PRIVATE_MEM_SIZE always returns 0

I have a kernel that works fine. When I inquire the CL_KERNEL_PRIVATE_MEM_SIZE, it is 0. However, the kernel clearly consumes private memory. Any idea, what I am doing wrong?

__kernel void P01(global float *buffer )
{
    int X = get_global_id(0);
    __local int a[100];
    a[13]=13;
    buffer[X]=X;
}

What the spec says about CL_KERNEL_PRIVATE_MEM_SIZE is a bit vague. There is a discussion related to that specific parameter at https://community.amd.com/thread/153716 , but from the information that I gathered now

  • private memory is something so deeply internal and specific for the device that a developer usually should not be concerned with that (except for extreme cases)
  • local memory and private memory are sometimes confused

From the fact that you’re using __local in your kernel, I think that this might also be the case here. Can you provide more details about what you’re trying to accomplish?

Hi Marco.
I wanted to explore the GPU and my kernel for performance tuning. The setting of the various parameters like work group size, memory usage (of different) types etc. is critical. So I wanted to check the CL_KERNEL_PRIVATE_MEM_SIZE too and see what it says for the kernel I am currently testing. I guess I should first learn somewhat more about the memory architecture and it’s impact.
Let’s close this here. I am going to continue my work with a new machine with an Nvidea GPU and I intend to move to CUDA instead of OpenCL anyway.

From my understanding, the private memory is a very low-level detail that’s hard to influence anyhow.

But of course, things like the work group size, memory usage and register usage are important for certain performance optimizations. For example, one important optimization tool (when you’re really deeply involved) for CUDA/NVIDIA is the https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html . But conversely, OpenCL has a built-in abstraction that basically says: When you pass null as the local work size, then the implementation should use the work size that is „most suitable“ for the kernel and the device.

There are some extensions like https://www.khronos.org/registry/OpenCL/extensions/nv/cl_nv_compiler_options.txt that you can use for certain optimizations if and only if you know that you’re on an NVIDIA platform. But OpenCL is supposed to be as device- and vendor independent as possible, and manually optimizing for a certain device is somehow the opposite of that idea.

Got it. I will soon have a NVIDEA Quadro RTX 3000 GPU instead of an AMD Radeon R9 375 at my disposal. I guess that will remove for a while any urge for more performance anyway.
Thanks for the tips.