Few questions about preparing algorythm utilizing jocl

So first of all hello guys, this is my first post here, hope we’ll get along :slight_smile:
I found this great jocl lib useful for my project however I’m still not prepared for the use of its full potential.
First of all i think i should tell u about the hardware and the algorythm that I’m preparing. I have intel dc E6800(2 cores / 2 threads) and HD5670(80 stream processors / 400 threads) in my box, however my company uses quadro fx570 and fx1700. The algorythm is a floating point calculation with 3 average integrals, searching for the optimal construction of wind power plant’s tower with 6 parameters and within chosen parameter range i think there will be up to 50 milion combinations.

My questions:

  1. The problem is that I’d like to know how to get info about the status of computing progress while computing. If i cut the computing into pieces and calculate the percentage of the progress between those pieces i get delays on kernel build and arguments transfers and it suxx so much that even CPU mode runs ten to twenty times faster. is there any possibility of sending a signal from each kernel instance to the main process after finishing calculation? I mean I have one thread running gui and second one computing the data and i want for example send signal to gui thread that it should increment ready combinations count or something. Maybe there is some possibility of calculating it piece by piece without the need of recompilation of the computing kernel, only in and out operations?

  2. Whats up with CUDA and OpenCL? Should it work out of box with cuda sdk? I have send my app for check out to the boss and he couldn’t run it even after installing cuda sdk on his notebook with fx570m ( something like 8600M gt) he had to install amd stream lib.

  3. What about 32bit lib for linux? Would be nice to have that(my core duo notebook doesn’t support x64 systems however it isnt so old that i shall trash it)

I have still some time with this cause boss failed while preparing the integrals formulas and something’s just not right :stuck_out_tongue:

Edit:

  1. I finally got it clear that i musn’t recompile the kernel for each set of parameters and it rocks now :smiley:

PS: couldnt find edit button for my previous post

Hello,

It’s nice to hear that you find JOCL useful. But maybe I should emphasize the disclaimer once more: I insist that it’s not my fault when you find yourself accidentally constructing a helicopter instead of a wind power plant or something :wink:

I’ll try to answer your questions from the shortest to the longest:

3. What about 32bit lib for linux?

For previous versions I also had uploaded the 32bit Linux binaries, but I rarely have the chance to use a 32bit Linux machine. I can see what I can do there. However, there are makefiles included, and in the best case, the compilation should mainly consist of typing ‚make‘ in the source directory, possibly after minor adjustments of the paths in the main Makefile.

2. Whats up with CUDA and OpenCL?

It should work with the software provided by NVIDIA, but the relevant part is not the SDK, but the Developer Drivers and the CUDA Toolkit that are available on http://www.nvidia.com/getcuda . Note, however, that the most recent version of the NVIDIA Toolkit does not yet support OpenCL 1.1

Now about question 1, the progress reporting:

You found out the most important thing on your own: The kernel only has to be compiled once. The compilation may take a tremendous amount of time, especially for more complex kernels with lots of loops which are inlined by the compiler on-the-fly. Once it is compiled, the further programmatic setup should be pretty fast. You also have to set up only the arguments which are changing between the calls, like input and output memory objects. One thing to be careful with is the memory management: Allocating and copying memory may easily become a bottleneck.

Some more general words:

In most cases, you should split the computation into smaller pieces anyhow: Newer versions of Windows have this „Timeout Detection and Recovery (TDR) Feature“: When the Graphics Card is busy for several seconds, Windows assumes that something went wrong and kills the process. This „feature“ can be disabled. More information about this can be found in the CUDA release notes, and in several of the NVIDIA forum threads. But even when this feature is disabled, one computation should not take too long: As long as the GPU is computing something, there will be NO interaction possible, since the Graphics Card is simply busy. (And this means really busy, like ‚the-mouse-cursor-does-not-move‘-busy…)

You might want to have a look at the „JOCLMandelbrot.java“ example on jocl.org - Samples (not the JOCLSimpleMandelbrot.java). Originally, this only was an internal experiment, and I put a lot of code into a single class to make it a standalone sample, whose structure thus of course is not optimal and it should not be considered as a reference solution. But maybe it illustrates some basic ideas, like splitting a larger computation into smaller Tasks, putting the tasks into a taskQueue which is worked off by several „TaskProcessor“ threads, and how intermediate results may be processed by the GUI.

Probably, there will be a tradeoff between the size of the sub-tasks and the speedup that can be achieved: Even when the setup for each kernel call is minimized, enqueuing a kernel is not for free, and it will most likely make no sense to enqueue many very small tasks (taking, for example, 50ms each) compared to enqueueing tasks that contain a considerable workload for the GPU, but might take a few seconds.

I’d be happy to hear about your progress with this task ( :slight_smile: )

bye
Marco

About the 32bit lib:
I have compiled it from the source - there was a bad use of elif on line 2213 giving compilation error so i changed it to else.
Anyway I’m still getting java.lang.UnsatisfiedLinkError: no JOCL-linux-x86 in java.library.path when i run my maths.

Question 4 - How does memory allocation and input/output data works. If i create an array does it copy the array for each kernel instance or just each kernel instance does have access to the same array but we access it by the global id? My problem is: I want to have small float array - up to 20 floats in it and tell each kernel to check if his data is better then actually in this float array - if yes -> overwrite it with your new better data. I thought it would run smoother than having 20 float arrays with n floats in it but still reading output data takes ages(1500-2000 ms) and As I’m still doing it piece by piece not to overflow the gpu and to spare the memory its not kind for me to have those two second lags each time while computing takes 1-2 ms…

**
Anyway I’m still getting java.lang.UnsatisfiedLinkError: no JOCL-linux-x86 in java.library.path when i run my maths.**

Hm - are you using the same project structure as for the other tests (i.e. is the .SO file in the same directory as during your tests on Windows) ?

**Question 4 - How does memory allocation and input/output data works. If i create an array does it copy the array for each kernel instance or just each kernel instance does have access to the same array but we access it by the global id? **

I’m not sure what you mean. The internal memory management is probably left to the OpenCL impelementation. But when you create a buffer object, and copy the data from the host into this buffer object, then there should be only this ONE buffer object, and this single buffer object is accessible by all kernels that are created in the same OpenCL context.

**My problem is: I want to have small float array - up to 20 floats in it and tell each kernel to check if his data is better then actually in this float array - if yes -> overwrite it with your new better data. I thought it would run smoother than having 20 float arrays with n floats in it … **

That is a really rough description, and I don’t know your problem structure and input/output data. It’s hard to judge whether this is the optimal way, especially since I’m not an OpenCL expert. Is the computation solely based on these 20 float values? And is it so complex that it’s worth to enqueue an own kernel for this? (That is, does this computation take, e.g. more than a few ms?). I only made the experience that “classical” or “intuitive” approaches sometimes don’t lead to the optimal speedup, and it is literally an art to find good data-parallel solutions for many problems…

**… but still reading output data takes ages(1500-2000 ms) and As I’m still doing it piece by piece not to overflow the gpu and to spare the memory its not kind for me to have those two second lags each time while computing takes 1-2 ms…
**

I can only guess what’s wrong here - it should NOT take 2000ms to read such a small chunk of data. It might (!) be a misinterpretation of some manual timing: When you have something like


cl_mem memory = ...

long beforeKernel = System.nanoTime();
clEnqueueNDRangeKernel(q, kernelThatWritesIntoMemory, 1, null, null, null, 0, null, null);
long afterKernel = System.nanoTime();
long kernelTime = afterKernel-beforeKernel; // Not really the kernel time!!! See notes below!

long beforeRead = System.nanoTime();
clEnqueueReadBuffer(q, memory, CL_TRUE, 0, s, d, 0, null, null);
long afterRead = System.nanoTime();
long readTime = afterRead-beforeRead;

then the ‘kernelTime’ will only be the time that it takes to enqueue the Kernel, and not the time for the whole execution of the kernel (enqueueing a kernel is NOT blocking). So you might get results like
kernelTime = 2ms
readTime = 2000ms
although it is actually the kernel which takes the 2 seconds. For precise timing information, you might use the OpenCL events, or for small tests at least insert something like


long beforeKernel = System.nanoTime();
clEnqueueNDRangeKernel(queue, kernelThatWritesIntoMemory, 1, null, null, null, 0, null, null);
**clFinish(queue);**
long afterKernel = System.nanoTime();
long kernelTime = afterKernel-beforeKernel;

to make sure you’re really measuring the kernel time in this case.

bye
Marco

Yes I’m putting the lib in the directory where the jar file is and run it from there.

My computing kernel:


void setMathKernel(float Fw, float E, float Ro, float d, float Mg, float D1, float D2, float D3)
    {
        MathKernel =
        "__kernel void "+
        "Integrator(__global float *H1, __global float *H2, __global float *H3,"+
        "__global float *G1, __global float *G2, __global float *G3, __global float *Output)"+
        "{"+
        "const float PI = 3.14159265359;"+
        "int gid = get_global_id(0);"+
                
        "float J1=(PI*(pow("+D1+", 4)-pow(("+D1+"-2*G1[gid]), 4)))/64;"+
        "float J2=(PI*(pow("+D2+", 4)-pow(("+D2+"-2*G2[gid]), 4)))/64;"+
        "float J3=(PI*(pow("+D3+", 4)-pow(("+D3+"-2*G3[gid]), 4)))/64;"+

        "float Alfa1=(("+Fw+"*pow(H1[gid], 2))/(2*"+E+"*J1))+(("+Fw+"*(H2[gid]+H3[gid]))*H1[gid]/("+E+"*J1));"+
        "float Alfa2=(("+Fw+"*pow(H2[gid], 2))/(2*"+E+"*J2))+(("+Fw+"*H3[gid]*H2[gid])/("+E+"*J2));"+

        "float F1=(("+Fw+"*pow(H1[gid], 3))/(3*"+E+"*J1))+((("+Fw+"*(H2[gid]+H3[gid]))*pow(H1[gid], 2))/(2*"+E+"*J1));"+
        "float F2=(("+Fw+"*pow(H2[gid], 3))/(3*"+E+"*J2))+((("+Fw+"*H3[gid])*pow(H2[gid], 2))/(2*"+E+"*J2))+H2[gid]*tan(Alfa1);"+
        "float F3=(("+Fw+"*pow(H3[gid], 3))/(3*"+E+"*J3))+H3[gid]*tan(Alfa1+Alfa2);"+

        "float Fc=F1+F2+F3;"+
        "float K="+Fw+"/Fc;"+

        "float M1=(0.25*"+Ro+"*H1[gid]*PI*(pow("+D1+", 2)-pow(("+D1+"-G1[gid]), 2)));"+
        "float M2=(0.25*"+Ro+"*H2[gid]*PI*(pow("+D2+", 2)-pow(("+D2+"-G2[gid]), 2)));"+
        "float M3=(0.25*"+Ro+"*H3[gid]*PI*(pow("+D3+", 2)-pow(("+D3+"-G3[gid]), 2)));"+

        "float Mr1=0;"+
        "float h=0;"+
        "while(h<H1[gid])"+
        "{"+
        "    h+="+d+";"+
        "    Mr1+=pow((("+Fw+"*pow(h, 3))/(3*"+E+"*J1))+(("+Fw+"*(H1[gid]+H2[gid]+H3[gid]-h)*pow(h,2))/(2*"+E+"*J1)),2);"+
        "}"+
        "Mr1*=(M1/pow(Fc, 2))*"+d+";"+

        "float Mr2=0;"+
        "h=H1[gid];"+
        "while(h<(H1[gid]+H2[gid]))"+
        "{"+
        "    h+="+d+";"+
        "    Mr2+=pow((F1+(h-H1[gid])*tan(Alfa1)+("+Fw+"*pow((h-H1[gid]), 3))/(3*"+E+"*J2))+(("+Fw+"*(H2[gid]+H3[gid]-h)*pow((h-H1[gid]),2))/(2*"+E+"*J2)),2);"+
        "}"+
        "Mr2*="+d+"*(M2/pow(Fc, 2));"+

        "float Mr3=0;"+
        "h=H1[gid]+H2[gid];"+
        "while(h<(H1[gid]+H2[gid]+H3[gid]))"+
        "{"+
        "    h+="+d+";"+
        "    Mr3+=pow((F1+F2+H2[gid]*tan(Alfa1)+(h-H1[gid]-H2[gid])*tan(Alfa1+Alfa2)+("+Fw+"*pow((h-H1[gid]-H2[gid]), 3))/(3*"+E+"*J3)+("+Fw+"*(H1[gid]+H2[gid]+H3[gid]-h)*pow((h-H1[gid]-H2[gid]),2))/(2*"+E+"*J3)),2);"+
        "}"+
        "Mr3*="+d+"*(M3/pow(Fc, 2));"+

        "float Mz="+Mg+"+Mr1+Mr2+Mr3;"+
        "float Nr=(30/PI)*sqrt(K/Mz);"+
        "if(Nr>Output[0])"+ //Here I update parameters of the best combination found yet
          "{"+
            "Output[0] = Nr;"+
            "Output[1] = H1[gid];"+
            "Output[2] = H2[gid];"+
            "Output[3] = H3[gid];"+
            "Output[4] = G1[gid];"+
            "Output[5] = G2[gid];"+
            "Output[6] = G3[gid];"+
            "Output[7] = M1;"+
            "Output[8] = M2;"+
            "Output[9] = M3;"+
            "Output[10] = F1;"+
            "Output[11] = F2;"+
            "Output[12] = F3;"+
          "}"+
        "}";
}

Two questions about this:
A: Why GPU context doesn’t compile so long kernel string? It doesn’t give me any errors just hangs. If i comment for example 6 lines from parameter update in last if statement it goes without any problems.

B: Thanks to you i checked properly computing times and still GPU makes the same job 10 times slower than my cpu. I’ve based my computing on the first sample from the page and I dunno if that sample automatically finds all available threads on my GPU or not. Its just impossible for my gpu to be worse in this than my cpu:

My CPU should have like 15 GFLOPS while GPU should have something like 620 GFLOPS so where did i screw up this kernel?

OK… This is quite complex…

There certainly are limitations of the kernel size, but I think they are not really specified, and at least, implementation dependent. You might consider to split the kernel and pull out some parts into own __device functions, maybe this helps…

The kernel code itself is far from optimal. I don’t know about the role of the ‚Fw‘, ‚Ro‘ etc. parameters which are inlined in the source code as constants. The code itself might become clearer when these are passed in as parameters to the kernel, or as "#define"d constants (#defines can be given as parameters when building the program). It would at least allow you to write the code into a single file, and you would not have to define it as a String. But maybe this is just a subjective issue. And by the way: You should be careful with floating point constants. Some compilers will complain about constants like „0.25“ when they should be treated as ‚float‘ constants. The string „0.25“ represents a ‚double‘ value. For a float value, you should use ‚0.25f‘.

Concerning possible optimizations of the kernel:

Some possible improvements are quite obvious, and could also be applied to the CPU code. A clever compiler might recognize some of these and optimize them on the fly. But you could at least try to help him, and see if you can achieve a speedup. If not, the code may at least become simpler :wink:

So these optimizations are only guesses - whether or not they bring an improvement has to be tested!

Something like
y = pow(x, 2)
will most likely be less efficient than a simple
y = xx
Additionally, when you already have computed y=x
x, you could replace each
z = pow(x, 3)
with
z = y*x;
which might also be faster. This could be stated more general: There are lots of (really lots of) common sub-terms. Just one small example:

float M1=(0.25*Ro*H1[gid]*PI*(pow(D1, 2)-pow((D1-G1[gid]), 2)));
float M2=(0.25*Ro*H2[gid]*PI*(pow(D2, 2)-pow((D2-G2[gid]), 2)));
float M3=(0.25*Ro*H3[gid]*PI*(pow(D3, 2)-pow((D3-G3[gid]), 2)));

could be written as something like

float constantA = 0.25*Ro*PI;
float M1=(constantA*H1[gid]*(pow(D1, 2)-pow((D1-G1[gid]), 2)));
float M2=(constantA*H2[gid]*(pow(D2, 2)-pow((D2-G2[gid]), 2)));
float M3=(constantA*H3[gid]*(pow(D3, 2)-pow((D3-G3[gid]), 2)));

and possibly replacing the 'pow’s yielding

float constantA = 0.25*Ro*PI;
float dmg1 = D1-G1[gid];
float dmg2 = D2-G2[gid];
float dmg3 = D3-G3[gid];
float M1=constantA*H1[gid]*(D1*D1-dmg1*dmg1);
float M2=constantA*H2[gid]*(D2*D2-dmg2*dmg2);
float M3=constantA*H3[gid]*(D3*D3-dmg3*dmg3);

Apart from that, there are some optimizations that are more specific for OpenCL (or GPUs in general) :

You should reduce the number of global memory accesses! Each access to global memory has an overhead (latency) of about 600 clock cycles (!). The memory locations H1[gid], H2[gid] … G2[gid] are literally accessed hundreds of times. You should at least pull out these accesses, like
float h1 = H1[gid];
float h2 = H2[gid];

float g2 = G2[gid];
And use these values later on, replacing something like
while(h<(H1[gid]+H2[gid]))
with
while(h<(h1+h2))

Again, a compiler might detect and internally optimize this, but you should not place your bet on that…

This can even be further optimized by using local memory, but this is advanced, and I’m not so much involved in the existing code (and OpenCL) to give more specific hints than: Look this up in the OpenCL programming guide :wink:

If possible you should reduce the number of branches (loops). But this might be hard, depending on the exact semantics of the computation, so should probably be done after all other optimization attempts.

One important point concerning the correctness of the kernel: Note that all the kernel is computed by several hundred threads in parallel. So the update in the last few lines…

if(Nr>Output[0])
{
    Output[0] = Nr;
    Output[1] = H1[gid];
    ...
}

will definitely mess up the results, and yield a lot of garbage: One thread might update Output[0], and another thread might update it simultaneously. There is no guarantee that Output[0] will contain a valid number afterwards - not to mention that Output[0] might contain results from one kernel, and Output[1] might contain results from another…

This could somehow be solved using „atomic functions“ (see Section „6.11.11 Atomic Functions“ of the OpenCL spec), but this might cause the threads to process this block sequentially, destroying lots of the parallelism. One option could be

  • allocate an output area of globalSize*13 values
  • let all threads write their results blindly into this memory (regardless of the quality of the solution)
  • afterwards search these ‚globalSize‘ results to find the best one

Again: It’s an art (and I’m not an artist in this sense, so consider all this just as hints and not as advices…)

bye
Marco

I found out why is it so slow but I still dunno how to change it. The problem is that my code associates only first stream processor of 80 available for computing so i have only 5 of 400 threads available

EDIT: I know that i should optimize the kernel source string however I stil doesn’t have the right formulas for this and somethings screwed up so until then i won’t optimize this. Its just i wanted to know if there’s something that could totaly be hanging my GPU’s SPs. I wan’t to know how to make the kernel used on all available SPs.

I assume that you did not yet see my post, but considering this, most of my guesses and hints might have been (at least partially) been unjustified.

How do you set up and call the kernel (which global/local work sizes etc) ?

How I create my context - I assume this is something i could screw up


public void createContext()
    {
        long numBytes[] = new long[1];

        // Obtain the platform IDs and initialize the context properties
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
        // Create an OpenCL context on a GPU device
        // = clCreateContextFromType(contextProperties, 0, null, null, null);
        if(AccType==0)
        {
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        }
        else
        {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType( contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);
        }
        if (context == null)
            {
                System.out.println("Unable to create a context");
            }
        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);
        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);
        // Create a command-queue
        commandQueue = clCreateCommandQueue(context, devices[0], 0, null);

            //Checking out my computing units
            int computeUnits[] = new int[1];
            clGetDeviceInfo(devices[0], CL_DEVICE_MAX_COMPUTE_UNITS, Sizeof.cl_uint, Pointer.to(computeUnits), null);
            System.out.println("computeUnits: "+computeUnits[0]+" numDevices: "+numDevices);
            //this gives me 5 units and 1 device for gpu and 2 units and 1 device for cpu

        // Create the program from the source code
        program = clCreateProgramWithSource(context, 1, new String[]{ MathKernel }, null, null);

        memObjects = new cl_mem[6];
        outputObject = new cl_mem[1];
        outputObject[0] = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * 13, null, null);

        // Build the program
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        kernel = clCreateKernel(program, "Integrator", null);
    }

And this is how i execute the kernel


        long global_work_size[] = new long[]{n};
        long local_work_size[] = new long[]{1};
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);
        clFinish(commandQueue);

You may try passing in ‘null’ as the local_work_size: This should cause the OpenCL implementation to choose the “most appropriate” local work size.

BTW: How did you find out that only one processor was used?

[QUOTE=Marco13]You may try passing in ‚null‘ as the local_work_size: This should cause the OpenCL implementation to choose the „most appropriate“ local work size.

BTW: How did you find out that only one processor was used?[/QUOTE]

Bless you! that null did the trick man :smiley:

About how did i found out? I thought i checked it because calculated times matched my idea but now i know my idea was bad idea :P. Anyway i was using only one stream processor because this: „long local_work_size[] = new long[]{1};“ i assume tells the list of SP ID’s to the CL.

Thanks for your great help.

Btw: Not all of your optimizations works better than my original code. Adding new variables to the section calculating M1, M2 and M3 made this slower than originally.

EDIT:
„Note, however, that the most recent version of the NVIDIA Toolkit does not yet support OpenCL 1.1“
I’ve used functions from the OpenCL 1.0 sample i assume so it wont matter to me.
So i need Dev drivers and CUDA toolkit for this to work? IS there anything I need to additionally configure?

Well, introducing new variables in a kernel can cause it to become slower, because the GPU has only a limited number of registers, and when the local variables do not fit into the registers, it becomes slow. I’m not sure if this could be alleviated by breaking the kernel into several smaller __device functions (but it is not unlikely that this will not help - although the definite answer probably heavily depends on the OpenCL implementation and the hardware…)

[QUOTE=kacperpl1;13722]
So i need Dev drivers and CUDA toolkit for this to work? IS there anything I need to additionally configure?[/QUOTE]

Usually not. I have not tested JOCL with the latest (RC2) release, but I think it should work. As soon as NVIDIA updates to OpenCL 1.1, JOCL will be updated as well.

About this:

One important point concerning the correctness of the kernel: Note that all the kernel is computed by several hundred threads in parallel. So the update in the last few lines…


if(Nr>Output[0])
{
    Output[0] = Nr;
    Output[1] = H1[gid];
    ...
}

will definitely mess up the results, and yield a lot of garbage: One thread might update Output[0], and another thread might update it simultaneously. There is no guarantee that Output[0] will contain a valid number afterwards - not to mention that Output[0] might contain results from one kernel, and Output[1] might contain results from another...

I know it shouldn’t work properly however while i’m checking its results there are no misplaced results - I’m checking if they match in xls doc prepared by my boss :stuck_out_tongue:
Anyway I’m not sure if its the best result because i think there can be situation like this:
Thread 1 has the best result
Thread 2 has little worse result

  1. T1 checks if(Nr>Output[0])
  2. T2 checks if(Nr>Output[0])
  3. T1 writes his results
  4. T2 writes his results

I couldve write boolean for checking if the resource is ready and loop while not ready until this but this would totaly suck because of read delays.

I understand that I should write and atomic operation for this and execute this instead of my if condition and its body but the question is how?

First a short answer for the final question (“how?”): I don’t know.

The functions are actually only intended to ensure data consistency for single 32bit values - coping with a block of 13 values which have to be written atomically based on a certain condition is probably more complicated. And I have to admit that I don’t have enough experience with atomic functions to definitely say how this could be done. It might be doable with some barrier(CLK_GLOBAL_MEM_FENCE), and possibly it might be necessary to use one atomic operation for each of the 13 values, but these are just guesses. All I can say is that it might be dangerous to rely on few experimental results where everything seemed to work fine - this could easily be different on another GPU, for example.

Sorry that I can’t help you with this, but maybe one of the OpenCL experts in the AMD- NVIDIA or Khronos forums can give you more specific hints about how this could be solved with atomics…

As I mentioned, my first approach would be to separate the computation of these values and the process of finding the maximum, although I can not say whether this would be the “best” solution.
Maybe one should also think about exploiting local memory, on the one hand for efficiency, and on the other hand to possibly alleviate this synchronization problem, but this also would require some more efforts to be done properly.

bye
Marco

I have some memory issues with my app. I thought that the problem is lazy garbage collector in java, however i just thought i’ll check if this still happens if i don’t allocate cl memory and execute the kernel and it doesnt => its the openCL issue. I think there is something wrong with clReleaseMemObject. It does release the pointer to the buffer array but i think it leaves the objects in buffer still allocated in the memory and my app is constantly consuming more memory. I hope u can give me idea what i could be doing wrong.
EDIT: I’ve got a hint - memory was collected finally when the process allocated 4GB ^^ on XP x64 SP2

I’m not sure what you refer to - this ist probably related to http://forum.byte-welt.de/showthread.php?t=3147 ?

So did the memory finally get collected? Can you give more specific information about you profiling runs (i.e. whether there is a large (and increasing) number of “living” objects) ?

Yeah, sorry i forgot about that topic. It got collected finally on machine with 8 gigs, however java.exe gets killed by systems with lesser memory when reaching above 2GB so its unpleasant. And yes, every loop im releasing whole memory. It looks like garbage collector will not wake until really massive amount of memory is allocated.

There are several settings that affect the behavior of the GC, but I can not imagine why it should be necessary to change them. Do you use some direct buffers, allocated with ByteBuffer.allocateDirect? I’m not sure how to find out the reason for this problem, since I assume that there is no “small” example which is similar to the real application and which helps to reproduce the error…?

I tried to reproduce this memory consumption with the sample i based my app on but i couldn’t, at least i couldn’t do it quickly, I’ll play with this sample more when i’ll have some free time and then maybe i’ll find the source of those leaks. And about direct allocation - No, I only use clCreateBuffer to allocate space for my input and output data.