From my understanding, the private memory is a very low-level detail that’s hard to influence anyhow.
But of course, things like the work group size, memory usage and register usage are important for certain performance optimizations. For example, one important optimization tool (when you’re really deeply involved) for CUDA/NVIDIA is the https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html . But conversely, OpenCL has a built-in abstraction that basically says: When you pass
null as the local work size, then the implementation should use the work size that is „most suitable“ for the kernel and the device.
There are some extensions like https://www.khronos.org/registry/OpenCL/extensions/nv/cl_nv_compiler_options.txt that you can use for certain optimizations if and only if you know that you’re on an NVIDIA platform. But OpenCL is supposed to be as device- and vendor independent as possible, and manually optimizing for a certain device is somehow the opposite of that idea.