Allocating memory within kernels

A quick CUDA question.
When we declare and intialize a variable within the CUDA kernel, it uses the GPU memory. That means that there must be some limitation on the memory usage of GPU.
What I want to know is, when we use shared int variable, it assigns the variable for each block in the grid. what happens if we simply use int in the kernel. would it assign the memory for each thread in each block?
i need a brief explanation. thanks…

Eisenhiem

Hello

The simple variables that are used inside a kernel are usually stored in registers. Of course, there is a limited number of registers. The number of registers is, as far as I know, limited per thread AND per block. For precise numbers, one would have to look up some specs for the different GPUs. But as an example, with arbitrarily chosen numbers to make it clear what it means:

  • You may have 8192 Registers per Block
  • You may have 32 Registers per Thread
  • You may have 512 Threads per Block
    That would mean that you could
  • EITHER have blocks with 512 threads, but each thread could then only use 8192/512=16 registers
  • OR have blocks with 256 threads, where each thread could then use 8192/256=32 registers
  • OR have blocks with 128 thread, where each thread could still only use 32 registers

The “optimum” value (and much more) could, as far as I understood this, be computed with the CUDA Occupancy calculator, but I have not yet investigated this in detail. I should do so, however…

In any case, this means that you should not use an excessive amount of local variables. For example, a kernel like


__global__ void kernel(...)
{
    int var00;
    int var01;
    ... // up to...
    int var99;

    // more code here...
}

would have a bad performance due to the high number of local variables. When there are not enough registers for the variables, the memory is stored “somewhere else” (not sure where exactly, in the worst case it’s global memory…), and reading and writing these variables will become horribly slow - this is frequently discussed in the NVIDIA forum, in threads with titles like “register pressure” or “reducing register count”. (The actual number of registers used can be found out by reading the PTX file of a kernel).

But again, the disclaimer: I’m not so much a CUDA expert, and this is only how I understood it. More… “profound” and detailed information may probably be found in the NVIDIA forums - and in any case, properly taking these things into account to adjust the strategies for designing “good” kernels may be challenging.

bye
Marco

Hmmm… thanks.
got it. i’ll find some other way to tackle the problem, instead of using large arrays etc.

Thx,
Eisenhiem

Well, there are several options: Shared memory is (in the best case) nearly as fast as registers. Textures/Images could be an option: They are cached, and may be comparativlely fast depending on the access pattern of the memory. Global and Constant memory are rather slow.