CUDA_ERROR_LAUNCH_OF_RESOURCES and memory allocation

I get this error.

What I do:

CUdevice device = new CUdevice();
JCudaDriver.cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
JCudaDriver.cuCtxCreate(context, 0, device);
long malloc_limit = 4096L ​* 1024L*​ 1024L;
JCudaDriver.cuCtxSetLimit(cudaLimitMallocHeapSize, malloc_limit);```
(I call this twice in the same program)
The second time it yields the error.
My guess: it doesn't have any memory left because it doesn't "unallocate" the previous one. 
Is there a way to check if it has already allocated the memory, and eventually how much? it's a huge program with plenty unknown variables, and I would like to be able to do this dynamically.
Meaning, unallocate memory if possible.

One other thing: this line JCudaDriver.cuCtxSetLimit(cudaLimitMallocHeapSize, malloc_limit);
as I understand it just preallocates the maximum size to be allocated within the cores. So, I guess that the space left is the one that I can use for cuMemcpyHtoD ? Is there a method that allocates the size for this? Or is it everything included in the cuCtxSetLimit??

*** Edit ***

Edit: I partially solved with cuCtxDestroy.

I still would like some clarifications about what memory I am allocating and if I can split the allocation of the memory between what I pass from the RAM and what I malloc in the VRAM

Admittedly, I have never used this particular feature (and never felt the necessity for it), and have to take a closer look at what it is actually supposed to accomplish.

A VERY quick and pragmatic test

import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUlimit;
import jcuda.driver.JCudaDriver;


public class DeviceLimitTest
{
    public static void main(String[] args)
    {
        JCudaDriver.setExceptionsEnabled(true);
        JCudaDriver.cuInit(0);
        create();
        create();
        create();
        create();
    }

    private static void create()
    {
        CUdevice device = new CUdevice();
        JCudaDriver.cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        JCudaDriver.cuCtxCreate(context, 0, device);
        long malloc_limit = 4096L * 1024L* 1024L;
        
        long size[] =  { 0 };
        JCudaDriver.cuCtxGetLimit(size, CUlimit.CU_LIMIT_MALLOC_HEAP_SIZE);
        System.out.println("Before "+size[0]);
        
        JCudaDriver.cuCtxSetLimit(CUlimit.CU_LIMIT_MALLOC_HEAP_SIZE, malloc_limit);
        JCudaDriver.cuCtxGetLimit(size, CUlimit.CU_LIMIT_MALLOC_HEAP_SIZE);
        System.out.println("After  "+size[0]);
    }
}

Does not cause an error.

(Side note: For clarity, I’d use “CUlimit.CU_LIMIT_MALLOC_HEAP_SIZE”. Although the constants have the same value in the runtime and driver API, it’s a driver function, and so the constant from the driver API should be used)

I also didn’t understand your question. With “cuMemAlloc”, you are always allocating from the GPU RAM. You can also allocate host memory with the CUDA API, but this should not be influenced by the “CUlimit.CU_LIMIT_MALLOC_HEAP_SIZE” setting…

I get that error if I set it as 1024 instead of 4096 and call it 4 times instead of two, so I really think it’s for that problem. (I’m allocating a lot of arrays. Gigabytes.) Without that I get errors.
Note 1: I read somewhere that 4096 is the limit for some problem. In fact, I have 6 GB of VRAM but after 4 yields it errors.

Clarification: with cuMemAlloc I’m allocating the memory for the arrays I create in the cuda code (int* array = new int[len] ). Since it’s memory allocated, my guess is that it can’t randomly be used by something else.
That is, I think that such memory can’t be used when I call JCudaDriver.cuMemAlloc and JCudaDriver.cuMemcpyHtoD. Am I wrong?

Edit: I realised I should specify one other thing. The error, I get it at the line: JCudaDriver.cuCtxSynchronize().

Sorry, it’s still not entirely clear: Do you call this for different devices? What should be the point of calling this multiple times (with different values) for the same device?

But more importantly: Memory that was allocated in a kernel can not be used from outside. At least, that’s what NVIDIAs Chief GPU Technologist, Mark Harris, said here: cuda - How to copy the memory allocated in device function back to main memory - Stack Overflow And I didn’t hear that this changed. (I just tried it out, and it didn’t seem to work, but maybe I have overlooked something?)

There also once was a limit of 4GB for the heap size, but this should no longer exist (although I can’t try it out).

And a side note: When the exception is thrown at cuCtxSynchronize, this might only mean that one of the previous API calls caused an error, due to the asynchronous nature of some API calls. But a guess/gut feeling is that there’s simply a memory allocation going on in a kernel that is just too large (you might narrow this down, for testing, by adding a cuCtxSynchronize before and after the last kernel launch)

Uh… I will try to be more clear. My program runs itself again over and over from a main function. Since it’s a long way to the point where I initialise the gpu, I don’t want to send it through all the way, but instead create and destroy instances when needed.

I call it for a single device. I am not trying to access device memory allocated inside a core from the outside! What I do is:

JCudaDriver.cuMemAlloc(costResPointer, resSize*Sizeof.FLOAT);
Pointer kernelParameters = Pointer.to(  
      Pointer.to(costResPointer)
);
JCudaDriver.cuLaunchKernel(cudaOpt.kernelFunction,
      gridSizeX, gridSizeY, 1,        // Grid dimension
      blockSizeX+1, 1, 1,               // Block dimension
      0, null,                        // Shared memory size and stream
      kernelParameters, null          // Kernel- and extra parameters
);
JCudaDriver.cuCtxSynchronize();
float[] res = new float[resSize];
JCudaDriver.cuMemcpyDtoH(Pointer.to(res), costResPointer, resSize * Sizeof.FLOAT);
JCudaDriver.cuMemFree(costResPointer);```


and this works. But if I just call it several times, then it fails. (The number of times is exactly the number of times needed to exceed the memory that I have on the GPU when allocating the heap size ).

My question was: I can do "int* array = new int[10]" inside a cuda core (meaning, inside the .cu file, that is being allocated only when I run cuLaunchKernel). That allocates the memory for that array inside the VRAM. The maximum size of the space that I can use to do those allocations is what I set with the line  JCudaDriver.cuCtxSetLimit(cudaLimitMallocHeapSize, malloc_limit); (Right? I'm not sure, I'm confused).

So, the question is: Where do I set the limit for allocating the memory with   JCudaDriver.cuMemAlloc(costResPointer, resSize*Sizeof.FLOAT);? Is it still the same portion of memory that I allocated with cuCtxSetLimit? So, does the method cuCtxSetLimit allocates the memory for int* array = new int[10] or does it allocate the memory for  JCudaDriver.cuMemAlloc(costResPointer, resSize*Sizeof.FLOAT);? Why do I get errors when I don't set the cuCtxLimit?

If it helps, here's what I think (could be entirely wrong or not make sense).

 cuCtxSetLimit is only for what I allocate INSIDE the cuda files. So, what is being dynamically allocated once the program is ran on the gpu. What I allocate with  JCudaDriver.cuMemAlloc (that is allocated BEFORE I run the cuda files), is allocated on the other portiion of memory.
Suppose I have 2 GB of VRAM. I want to allocate 1 because I need 1 gb of dynamic allocations inside the cuda core. Then I use cuCtxSetLimit (1 GB) to allocate that. The gigabyte left, can be used by  JCudaDriver.cuMemAlloc. Am I entirely wrong? Am I partially wrong? Am I right?

First of all, short disclaimer: I’m not a CUDA expert. Of course, I try to stay up to date, and familiarize myself with the most important concepts, but I cannot make definite statements about each and every dark corner of the details of the CUDA API. It would be great if I could recommend asking certain questions at the NVIDIA CUDA forums, but experience shows that the responsiveness there is often very limited. If the question is not specifically related to JCuda, but to CUDA in general, and if it can be stated clearly, then e.g. stackoverflow might be an alternative.
But I’ll try to respond as good as I can here.

Ths description is a bit vague. The setup and teardown may be a bit costly, but maybe it’s appropriate for your application case.

And if I understood it correctly, then you are allocating memory in this kernel call, right? The question now is: Do you also free this memory somewhere?

AFAIK this is correct.

[QUOTE=N17;136208]
So, the question is: Where do I set the limit for allocating the memory with JCudaDriver.cuMemAlloc(costResPointer, resSizeSizeof.FLOAT);? Is it still the same portion of memory that I allocated with cuCtxSetLimit? So, does the method cuCtxSetLimit allocates the memory for int array = new int[10] or does it allocate the memory for JCudaDriver.cuMemAlloc(costResPointer, resSize*Sizeof.FLOAT);? Why do I get errors when I don’t set the cuCtxLimit?

If it helps, here’s what I think (could be entirely wrong or not make sense).

cuCtxSetLimit is only for what I allocate INSIDE the cuda files. So, what is being dynamically allocated once the program is ran on the gpu. What I allocate with JCudaDriver.cuMemAlloc (that is allocated BEFORE I run the cuda files), is allocated on the other portiion of memory.
Suppose I have 2 GB of VRAM. I want to allocate 1 because I need 1 gb of dynamic allocations inside the cuda core. Then I use cuCtxSetLimit (1 GB) to allocate that. The gigabyte left, can be used by JCudaDriver.cuMemAlloc. Am I entirely wrong? Am I partially wrong? Am I right?[/QUOTE]

As far as I understood it, I think that this should be correct. According to the respective section in the programming guide:

The actual memory allocation for the heap occurs when a module is loaded into the context, either explicitly via the CUDA driver API (see Module), or implicitly via the CUDA runtime API (see CUDA C Runtime). If the memory allocation fails, the module load will generate a CUDA_ERROR_SHARED_OBJECT_INIT_FAILED error.

Heap size cannot be changed once a module load has occurred and it does not resize dynamically according to need.

Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().

So the call to „cuCtxSetLimit“ only „reserves“ the memory for the allocation inside the kernel. So when you call cuCtxSetLimit with 1GB, then you could run a Kernel with 1024 threads, and do a „malloc(1MB)“ in each of the threads. On a 2GB card, (only) the remaining memory of 1GB will be available for cuMemAlloc calls.

What you posted agrees with what I expected, so I think this is solved! I just wanted a confirmation.