Few questions about preparing algorythm utilizing jocl

kacperpl1 · 1. Dezember 2010 um 00:57

I’ve found out what isn’t flushed but I dunno how to flush it. The problem is i have two methods; one for reading output and one for clearing the memory. I’ve checked that the memory is consumed by readbuffers but how to release them?

thats how i read the outputs

Pointer dst = Pointer.to(NRoutput);
// Read the output data
clEnqueueReadBuffer(commandQueue, outputObject[0], CL_TRUE, 0, n * Sizeof.cl_float, dst, 0, null, null);
Pointer dst1 = Pointer.to(M1output);
clEnqueueReadBuffer(commandQueue, outputObject[1], CL_TRUE, 0, n * Sizeof.cl_float, dst1, 0, null, null);
Pointer dst2 = Pointer.to(M2output);
clEnqueueReadBuffer(commandQueue, outputObject[2], CL_TRUE, 0, n * Sizeof.cl_float, dst2, 0, null, null);
Pointer dst3 = Pointer.to(M3output);
clEnqueueReadBuffer(commandQueue, outputObject[3], CL_TRUE, 0, n * Sizeof.cl_float, dst3, 0, null, null);
Pointer dst4 = Pointer.to(F1output);
clEnqueueReadBuffer(commandQueue, outputObject[4], CL_TRUE, 0, n * Sizeof.cl_float, dst4, 0, null, null);
Pointer dst5 = Pointer.to(F2output);
clEnqueueReadBuffer(commandQueue, outputObject[5], CL_TRUE, 0, n * Sizeof.cl_float, dst5, 0, null, null);
Pointer dst6 = Pointer.to(F3output);
clEnqueueReadBuffer(commandQueue, outputObject[6], CL_TRUE, 0, n * Sizeof.cl_float, dst6, 0, null, null);
Pointer dst7 = Pointer.to(MRoutput);
clEnqueueReadBuffer(commandQueue, outputObject[7], CL_TRUE, 0, n * Sizeof.cl_float, dst7, 0, null, null);

And thats how i flush output objects

clReleaseMemObject(outputObject[0]);
clReleaseMemObject(outputObject[1]);
clReleaseMemObject(outputObject[2]);
clReleaseMemObject(outputObject[3]);
clReleaseMemObject(outputObject[4]);
clReleaseMemObject(outputObject[5]);
clReleaseMemObject(outputObject[6]);
clReleaseMemObject(outputObject[7]);

I know its this readbuffer not getting flushed cause when it isn’t used my app consumes about 32MB all the time.

Marco13 · 1. Dezember 2010 um 03:33

Can you post (or send as PM) an example where you removed as much as possible of the remaining code? It doesn’t matter whether the outputObject** objects contain real data, but the allocation/deallocation may be helpful to see the basic workflow, and to possibly reproduce this problem. In any case, I’ll have another look at the source of clEnqueueReadBuffer to see what might go wrong there, but I have not experienced any problems with this function yet.

kacperpl1 · 1. Dezember 2010 um 04:09

This is my class utilizing OpenCL. The method calcNr is looped and it calcuates each part of computing and at the end it reads output with readOutput method and flushes memory with clearMemory method. I have optimized the algorythm so i have only 2 output buffers now but still readOutput leaves really big garbage uncollected. Somehow i was unable to PM this message to you…

import java.lang.Math.*;
import static org.jocl.CL.*;
import org.jocl.*;

public class ZefirMath {
    String MathKernel;
    int AccType;
    int dg, n;
    float NRz;
    cl_context context;
    cl_kernel kernel;
    cl_command_queue commandQueue;
    cl_mem memObjects[];
    cl_mem outputObject[];
    cl_program program;
    float NRoutput[];
    float Moutput[];

    float H1input[];
    float H2input[];
    float H3input[];
    float G1input[];
    float G2input[];
    float G3input[];

    Pointer Arg1;
    Pointer Arg2;
    Pointer Arg3;
    Pointer Arg4;
    Pointer Arg5;
    Pointer Arg6;

    Params BestNr;
    Params BestMass;

    public void init(int acc, int idg, float nr, float Fw, float E, float Ro, float d, float Mg, float D1, float D2, float D3)
    {
        AccType = acc;
        dg=idg;
        n=0;
        NRz=nr;
        setMathKernel(Fw, E, Ro, d, Mg, D1, D2, D3);
        createContext();
    }

    final public void createContext()
    {
        long numBytes[] = new long[1];
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);

        cl_context_properties contextProperties = new cl_context_properties();

        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
        if(AccType==0)
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_ALL, null, null, null);
        }
        else if (AccType==1)
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        }
        else
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);
        }
        if (context == null)
            {
                System.out.println("Unable to create a context");
            }
        CL.setExceptionsEnabled(true);
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);
        commandQueue = clCreateCommandQueue(context, devices[0], 0, null);
        program = clCreateProgramWithSource(context, 1, new String[]{ MathKernel }, null, null);
        clBuildProgram(program, 0, null, null, null, null);
        kernel = clCreateKernel(program, "Integrator", null);
    }

    public void clearMemory()
    {
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseMemObject(memObjects[2]);
        clReleaseMemObject(memObjects[3]);
        clReleaseMemObject(memObjects[4]);
        clReleaseMemObject(memObjects[5]);
        clReleaseMemObject(outputObject[0]);
        clReleaseMemObject(outputObject[1]);
    }

    public void clearKernel()
    {
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
    }

    Params ReadOutput(boolean bestmass)
    {
        Pointer dst = Pointer.to(NRoutput);
        clEnqueueReadBuffer(commandQueue, outputObject[0], CL_TRUE, 0, n * Sizeof.cl_float, dst, 0, null, null);
        Pointer dst1 = Pointer.to(Moutput);
        clEnqueueReadBuffer(commandQueue, outputObject[1], CL_TRUE, 0, n * Sizeof.cl_float, dst1, 0, null, null);

        clFinish(commandQueue);
        float bestNr=0;
        float bestM=99999;
        int bestId=0;
        if(bestmass)
            for(int i=0; i<n; i++)
            {
                    if(Moutput** < bestM && NRoutput** > NRz)
                    {
                        bestId=i;
                        bestNr=NRoutput**;
                        bestM=Moutput**;
                    }
            }
        else
            for(int i=0; i<n; i++)
            {
                    if(NRoutput**>bestNr)
                    {
                        bestId=i;
                        bestNr=NRoutput**;
                        bestM=Moutput**;
                    }
            }
        Params Best = new Params(NRoutput[bestId], 0, Moutput[bestId], //Nr, Mr, M
                H1input[bestId],H2input[bestId],H3input[bestId],      //H1,H2,H3
                G1input[bestId],G2input[bestId],G3input[bestId],      //G1,G2,G3
                0,0,0,   //M1,M2,M3
                0,0,0);  //F1,F2,F3
                
        return new Params();//Best;
    }

    Params Recalculate(float Fw, float E, float Ro, float d, float Mg,
                    float D1, float D2, float D3,
                    float H1, float H2, float H3,
                    float G1, float G2, float G3)
    { ... }

    final void setMathKernel(float Fw, float E, float Ro, float d, float Mg, float D1, float D2, float D3)
    { ... }

    Params calcNr(float H, float H1, float H2, float G1, float G2min, float G3min, float G2max, float G3max)
    {
        int j=0;
                    float G2=G2min*10000;
                    do
                    {
                        G2=G2+dg;
                        float G3=G3min*10000;
                        do
                        {
                            j++;
                            G3=G3+dg;
                        }
                        while(G3<G3max*10000);
                    }
                    while(G2<G2max*10000);
        if(j!=0)
        {
            
        n = j;
        if(H1input == null)
        {
            H1input = new float[n];
            H2input = new float[n];
            H3input = new float[n];
            G1input = new float[n];
            G2input = new float[n];
            G3input = new float[n];
        
            Arg1 = Pointer.to(H1input);
            Arg2 = Pointer.to(H2input);
            Arg3 = Pointer.to(H3input);
            Arg4 = Pointer.to(G1input);
            Arg5 = Pointer.to(G2input);
            Arg6 = Pointer.to(G3input);
        }

        j=0;
                float H3=H-H1-H2;
                int h3=java.lang.Math.round((H3*10));
                H3=(float)h3/10;
                    G2=G2min*10000;
                    do
                    {
                        float G3=G3min*10000;
                        do
                        {
                            H1input[j]=H1;
                            H2input[j]=H2;
                            H3input[j]=H3;
                            G1input[j]=G1;
                            G2input[j]=G2/10000;
                            G3input[j]=G3/10000;
                            G3=G3+dg;
                            j++;
                        }
                        while(G3<G3max*10000);
                        G2=G2+dg;
                    }
                    while(G2<G2max*10000);

        if(memObjects == null)
        {
            memObjects = new cl_mem[6];
        }
            memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg1, null);
            memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg2, null);
            memObjects[2] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg3, null);
            memObjects[3] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg4, null);
            memObjects[4] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg5, null);
            memObjects[5] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n, Arg6, null);

        if(outputObject == null)
        {
            outputObject = new cl_mem[2];
        }
            outputObject[0] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * n, null, null);
            outputObject[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * n, null, null);

        clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(memObjects[1]));
        clSetKernelArg(kernel, 2, Sizeof.cl_mem, Pointer.to(memObjects[2]));
        clSetKernelArg(kernel, 3, Sizeof.cl_mem, Pointer.to(memObjects[3]));
        clSetKernelArg(kernel, 4, Sizeof.cl_mem, Pointer.to(memObjects[4]));
        clSetKernelArg(kernel, 5, Sizeof.cl_mem, Pointer.to(memObjects[5]));
        clSetKernelArg(kernel, 6, Sizeof.cl_mem, Pointer.to(outputObject[0]));
        clSetKernelArg(kernel, 7, Sizeof.cl_mem, Pointer.to(outputObject[1]));

        long global_work_size[] = new long[]{n};
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, null, 0, null, null);
        clFinish(commandQueue);
        if(NRoutput == null)
        {
         NRoutput = new float[n];
         Moutput = new float[n];
        }
        BestNr = ReadOutput(false);
        BestMass = ReadOutput(true);
        clearMemory();
        return BestNr;
        }
        else
        return new Params();
    }
}

Marco13 · 1. Dezember 2010 um 04:38

I also received the PMs. A compileable example to reproduce the problem (again, without any kernel invocations, just reduced to the memory management) would be more helpful, but I’ll try to have a look at it.

kacperpl1 · 1. Dezember 2010 um 05:31

Here’s the sample reproducing this memory consuming problem u asked for - comment out execution of ReadOutput method to see the difference in allocated memory.

import static org.jocl.CL.*;
import org.jocl.*;
import java.io.*;

public class Kernel {
    String MathKernel;
    int AccType;
    int dg, n;
    float NRz;
    cl_context context;
    cl_kernel kernel;
    cl_command_queue commandQueue;
    cl_mem memObjects[];
    cl_mem outputObject[];
    cl_program program;
    float NRoutput[];
    float Moutput[];

    public void init(int acc, int idg, float nr, float Fw, float E, float Ro, float d, float Mg, float D1, float D2, float D3)
    {
        AccType = acc;
        dg=idg;
        n=0;
        NRz=nr;
        createContext();
    }

    final public void createContext()
    {
        long numBytes[] = new long[1];

        // Obtain the platform IDs and initialize the context properties
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);

        cl_context_properties contextProperties = new cl_context_properties();

        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
        if(AccType==0)
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_ALL, null, null, null);
        }
        else if (AccType==1)
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        }
        else
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);
        }
        if (context == null)
            {
                System.out.println("Unable to create a context");
            }
        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);
        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);
        // Create a command-queue
        commandQueue = clCreateCommandQueue(context, devices[0], 0, null);
    }

    public void clearMemory()
    {
         // Release kernel, program, and memory objects
        clReleaseMemObject(outputObject[0]);
        clReleaseMemObject(outputObject[1]);
    }

    public void clearKernel()
    {
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
    }

    void ReadOutput(boolean bestmass)
    {
        Pointer dst = Pointer.to(NRoutput);
        clEnqueueReadBuffer(commandQueue, outputObject[0], CL_TRUE, 0, n * Sizeof.cl_float, dst, 0, null, null);
        Pointer dst1 = Pointer.to(Moutput);
        clEnqueueReadBuffer(commandQueue, outputObject[1], CL_TRUE, 0, n * Sizeof.cl_float, dst1, 0, null, null);

        clFinish(commandQueue);
    }

    void calcNr(float H, float H1, float H2, float G1, float G2min, float G3min, float G2max, float G3max)
    {
        n=1000;
         NRoutput = new float[n];
         Moutput = new float[n];
                 if(outputObject == null)
        {
            outputObject = new cl_mem[2];
        }
            outputObject[0] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * n, null, null);
            outputObject[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * n, null, null);
            
        ReadOutput(false);
        ReadOutput(true);
        clearMemory();
    }

    public static void main(String args[])
    {
        Kernel kalkulator = new Kernel();
        kalkulator.init(0, 0, 130, (float)10000, (float)210000000, (float)7850, (float)0.10, (float)800, (float)0.711, (float)0.711, (float)0.711);
        //loop the memory consumer ;/
        for(int i=0; i<1000000; i++)
        {
            kalkulator.calcNr(5, 5, 5, (float)0.01, (float)0.01, (float)0.01, (float)0.01, (float)0.01);
        }
        //Wait before killing the process
        try {
        InputStreamReader isr = new InputStreamReader(System.in);
        BufferedReader bufReader = new BufferedReader(isr);
        System.out.println("Press a key to exit");
        bufReader.readLine();
        }
        catch (IOException e) {
        System.err.println("Error: " + e);
        }
        kalkulator.clearKernel();
    }
}

Marco13 · 1. Dezember 2010 um 10:47

OK, I started and ran this example (WITH the ReadOutput method) and did not experience any problem. It took a while until it finished, but I traced the run with the VisualVM, and it seems that the Heap gets cleared regularly, and there are no „live“ objects that survived for large number of generations. I attached two screenshots (at least as a proof for my statement but maybe you want to compare it to one of your test runs…)

Can you give me information about this particular problem?

Operating System
CL implementation
JOCL version
Java JRE version

You mentioned that it worked on XP x64 SP2? Or was the allocation of 4GB the problem there?

kacperpl1 · 1. Dezember 2010 um 10:57

The problem is that the process consumes more memory while from within my app java heap memory is constant. Its something like the opencl’s native memory garbage isn’t collected rather than java’s memory.
My machines and effects of testing:

Win7 X32 2Gb - app gets killed with vc++ runtime error after allocating randomly between 1 -1.5 GB
WinXP X32 4Gb -app gets killed with vc++ runtime error after allocating between 1.5-2GB
WinXP X64 8GB - app allocates more and more memory until gets 4GB and then it looks lik the garbage gets collected.

All of memory allocated i check from task manager because this isn’t the jvm heap problem.

Marco13 · 2. Dezember 2010 um 15:29

Hm. “WinXP X32 4Gb” is exactly what I use for testing here. Which OpenCL version are you using on this platform?
I’ll inverstigate this further with AMD’s Stream on the CPU by next week…

kacperpl1 · 2. Dezember 2010 um 23:54

I’m using AMD stream and cpu mode mostly - nvidia quadro fx 570 aren’t as fast as C2D in my maths and nvidia cuda sdk + amd stream provides a bug which blocks the use of cpu emulation somehow on all of my platforms(i’ve posted this bug last week at amd support). While I’m mostly not using the gpu acceleration I still like the opportunity to combine fast enough maths with easy gui creation in netbeans and other java advantages.

The example I’ve sent doesnt consume memory as fast as when I first posted the problem however you should see the constant consumption of memory for the process.

Marco13 · 10. Dezember 2010 um 08:44

This has not been forgotten, but sorry, I’ll have to postpone my tests until next week …

kacperpl1 · 10. Dezember 2010 um 10:35

thats cool, no need to rush - after optimizing the code to read only two arrays instead of whole 8 arrays it doesn’t consume the memory fast enough to make problems and it can be left like this as working but incomplete version, however I’d like to know what have I done wrong before using jocl again

Marco13 · 13. Dezember 2010 um 14:41

I’m not sure if you have done anything wrong at all - although I wish I could say this was the reason…

Specifically for your task: You might consider keeping the cl_mem objects and not creating/releasing them in each pass, if this is possible.

I simplified the program to only call


cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null);
clEnqueueReadBuffer(commandQueue, mem, CL_TRUE, 0, size * Sizeof.cl_float, dst, 0, null, null);
clFinish(commandQueue);
clReleaseMemObject(mem);

one million times, and saw the memory slowly increasing in the Task Manager (whrereas in C it works as with a constant memory usage). This increase seems fairly independent of the ‘size’ of the allocated memory object. So I thought that there must be an error in the clEnqueueReadBuffer method, because it is the only one that contains some aspects of the (fairly non-trivial) memory management for potentially non-blocking operations and JNI array handling. But I counterchecked it several times, and did not find an error - which seems to be confirmed by the profiler runs I mentioned earlier.

Then I reduced it to


cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null);
clReleaseMemObject(mem);

and the memory still increased - much slower, but still by ~40MB for 1 million calls. So I thought there might be a general problem in the JNI code, although this seemed unlikely. But again, I did not find anything there.

Finally, I commented out all function calls in the native code of clCreateBuffer, so that it effectively only contained


cl_context_id nativeContext = (cl_context)env->GetLongField(context, ...);
cl_mem nativeMem = clCreateBuffer(nativeContext, x, y, NULL, NULL);
**clReleaseMemObject(nativeMem);
return NULL;**

(which should be a no-op, of course), and it still consumed 40 bytes per call. Nearly no JNI operations. No object creations. No local or global references. No memory allocations. Nothing that could even remotely be considered to cause a memory leak. But still consuming 40 bytes.

At least partially, it seems to depend on the OpenCL implementation: For the second case (creating and immediately releasing the cl_mem on Java side), the increase of 40 bytes per call only happended in the NVIDIA implementation, whereas in the AMD implementation, the memory usage was constant. With the first case (including the clEnqueueReadBuffer) the memory consumption was about 320 bytes for the whole operation on an AMD platform, but more than twice as much on the NVIDIA platform.
There may be interdependencies between the JVM and the OpenCL implementation that may not be so obvious. And admittedly, I’m running out of ideas. Any hints or ideas about approaches for further investigations are welcome.

Marco13 · 16. Dezember 2010 um 10:03

I don’t want to accept this as it is. But maybe I’ll have to. I just commented out only the cl*-Calls in the native method implementations of clCreateBuffer and clEnqueueReadBuffer, leaving the remaining code untouched, and ran the example again: In this case, the memory remains constant. So whatever the reason for the memory consumption is, it is probably beyond my control…

Marco13 · 18. Februar 2011 um 01:40

A small update concerning this issue: I ran a test which basically consisted only of

Creating a context
Creating and deleting 10 Million cl_mem objects in a loop

The test was run on Windows XP 32, with NVIDIA driver 263.06 and CUDA Toolkit 3.2_16, as well as with the AMD Stream SDK 2.2/2.3. The test was run with different libraries:

JOCL
JOCL from Jogamp
JavaCL
LWJGL
A small, minimalistic JNI lib, with only the minimum set of elementary JNI functions required for the test

As far as I could observe this, all tests showed the same behavior: Using the NVIDIA platform, the required memory increased steadily and quickly (and this refers to the native memory - according to jVisualVM, the Java Part got cleaned up properly by the GC), and eventually caused the JVM to crash. With the AMD platform, the required memory was nearly constant. At the moment, I have to assume that there may be memory leaks in the native OpenCL implementations which only show up when the OpenCL calls are done from Java - but regardless of which particular Java binding is used.
This issue is investigated further, hopefully it can be resolved…

kacperpl1 · 2. März 2011 um 01:19

Small update from my side - tested my app on ubuntu amd64 with nvidia card (420M, 260.19.06) and 0.1.4d and it looks like it also has leaks but it sometimes collects a little of the garbage.

Marco13 · 2. März 2011 um 03:11

Well, the main problem is not the “garbage” that can be collected: Everything that can be cleaned up by the garbage collector will be cleaned up (sooner or later, when the GC decides that it may be appropriate). There should be everything fine on Java side. (And if there was a memory leak on Java side, or one that is specifically related to my JOCL-JNI part, I’d try to fix it as soon as possible, of course!)

The problem in this case is that the memory is not leaking in the Java part, and not in the JNI part, but solely in the actual OpenCL implementation by NVIDIA, and only when the CL functions are called from Java. Keeping everything as-it-is, and simply commenting out the CL calls inside the JNI code “fixed” the problem. (And BTW, in JCuda there is no such problem - it seems to be very specific for NVIDIAs OpenCL implementation…)

I’ve been talking to Michael Bien from Jogamp about this issue, he has some ideas about possible reasons and seems to be in contact with NVIDIA about that.

NVIDIA has just published CUDA 4.0, hopefully their next step will be a public release of OpenCL 1.1, hand hopefully, some issues like this might be resolved there… (although I get the impression that they are not as … “committed” to OpenCL as they could be…)

kacperpl1 · 2. März 2011 um 04:20

Ok, so to get all in one place:
OpenCL calls from java leaves uncollected garbage on both Nvidia GPU’s and intel CPU’s context so actually the only unbugged solution for jocl would be amd gpu and possibly amd cpu(i haven’t tested opencl on amd cpu’s)

Marco13 · 2. März 2011 um 04:54

I will not allege that JOCL is bug-free

I only wanted to say that the major amount of leaked memory comes from the OpenCL implementation, and more importantly: that I can not influence this in any way, since it it obviously a general problem when calling NVIDIA’s OpenCL from Java. I’ll have to do more specific tests for possible memory leaks in JOCL on AMD platforms, to make sure that there’s nothing wrong with JOCL in this point, although I can only test the AMD platform with CPU devices.

In general, we can assume that NVIDIA will fix this sooner or later. If there are still (further) memory leaks or other bugs in JOCL, it’s my obligation to fix them.

kacperpl1 · 3. März 2011 um 01:53

I think i know how should i fix this(theoretically) but i cannot find practical method in khronos manual.

If there’s a possibility of reuse the once created buffer it would be enough for me to create a buffer with maximum size and constantly reuse it in each loop(now each loop creates smaller buffer than the last one so it would fit).

Can u somehow direct me how to rewrite the data in cl_mem object / buffer? That could be nice method of preventing the leaks just by creating standard buffer for the application.

Marco13 · 3. März 2011 um 06:50

In general, this might be a good approach, but unfortunately, the memory leak seems (at least to some extent) really be related to the function calls - that is, even OpenCL functions that do not allocate new memory may cause memory to be consumed

However, to reuse cl_mem-objects, you could create them once (in some constructor, using clCreateBuffer), and later fill them with
clEnqueueWriteBuffer(commandQueue, mem, true, 0, size, Pointer.to(hostData), 0, null, null);
before each kernel call.