Calculating Pi

Neuw84 · 10. Mai 2010 um 08:04

Hi there,
I’m trying to calculate Pi using OpenCl using some integrals, I have encountered that without using doubles I lost a lot of precission but If try to use doubles in my program, just doesn’t work…
I have read this one and try it. (using OpenCL extensions…)
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=125571&enterthread=y

Seems that when I create clCreateBuffer and then put inside it “Sizeof.cl_double” or reads the results in the same way, I always get wrong results… (seems like doubles not initialized).

My kernel code is this one, here using float.( this one works)

 #pragma OPENCL EXTENSION cl_khr_fp64 : enable
	__kernel void pi(__global const int * iteraciones,__global float * salida)
	{
	//pid sera la unidad de proceso dada
	int iter=iteraciones[0];
	uint pid = get_global_id(0);
	uint nth =  get_global_size(0); 
	float	paso = 0, x = 0, sum = 0.0;
	int trozo = iteraciones[0] / nth;
	int resto = iteraciones[0] % nth;
    paso = 1.0 / (double)iter;
    for (int i = pid*trozo; i < (pid+1)*trozo; i++) {
            x = (i + 0.5) * paso;
            sum = sum + 4.0 / (1.0 + x * x);
        
	}
	//si somos el ultimo procesamos el resto
	if(pid == nth-1){
		for (int i = pid*trozo; i < pid*trozo+resto; i++) {
            x = (i + 0.5) * paso;
            sum = sum + 4.0 / (1.0 + x * x);
        
	}
	}
	salida[pid]=sum * paso;
	}

My memory allocation code:

        cl_mem memObjects[] = new cl_mem[2];
        memObjects[0] = clCreateBuffer(context,
            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
            Sizeof.cl_int, srcA, null);
        memObjects[1] = clCreateBuffer(context,
            CL_MEM_READ_WRITE,
            Sizeof.cl_float * n, null, null);```
My memory read code: 
``` // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
            n * Sizeof.cl_float, dst, 0, null, null);```
This library have support for this kind off OpenCL extensions?
Thanks in advance

Marco13 · 10. Mai 2010 um 14:17

Hello

Do you have a Graphics card that supports double precision? According to the List on this AMD site you’ll need one of these…:
Radeon HD 5900
Radeon HD 5800
Radeon HD 4800
Mobility Radeon HD 4800
FirePro V8800
FirePro V8700
FirePro V7800
FireStream 9200

Since I’m currently using an NVIDIA card, I can hardly test the experimental(!) double support of AMD.

It may be possible to emulate double precision using special single precision routines, as it was mentioned in the thread you linked. There are some (few!) QuadFloat functions on jocl.org, but only basic arithmetic and no functions for converting these numbers to Strings - although these functions may be added in the future. In any case, these functions may be complicated to use since there is no Operator Overloading in OpenCL. If you can provide a small example showing with which arguments this kernel should be called, I can try to use QuadFloat for this, but I can not make any promises that this will work…

BTW: The kernel you posted is still using double values. In order to use float, you should add the suffix ‘f’ to all the constant values:


paso = 1.0**f** / (float)iter;
...
x = (i + 0.5**f**) * paso;
...

otherwise other OpenCL implementations (namely that of NVIDIA) might complain about the double values…

bye

Neuw84 · 10. Mai 2010 um 15:13

Thanks for your repply,
I have and Nvidia card (Quadro 770m) but… bad luck I don’t have support for doubles at all,
(I though my card had support for it but I checked my supported extensions now and back luck…). I will try tomorrow with QuadFloat and post some results
if I can make the program to work.
If you want to try, the parameters for the kernel are:
[ol]
[li]Number of iterations (more, more PI precission)
[/li][li]Ouput array off size like group size
[/li][/ol]
Here is the Java code at all (like your small example)

package opencl;

/*
 * JOCL - Java bindings for OpenCL
 *
 * Copyright 2009 Marco Hutter - http://www.jocl.org/
 */


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import static org.jocl.CL.*;
import org.jocl.*;

/**
 * A small JOCL sample.
 */
public class Main
{
    /**
     * The source code of the OpenCL program to execute
     */
    private static String programSource =readFile("kernels/Picl.cl");

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[])
    {
        long numBytes[] = new long[1];

        // Create input- and output data
        int n = 512;
        int[] srcIter = new int[1];
        float dstArray[] = new float[n];
        for (int i=0; i<n; i++)
        {
            dstArray** = 0;
        }
        srcIter[0]=90000000;
        Pointer srcA = Pointer.to(srcIter);
        Pointer dst = Pointer.to(dstArray);

        // Obtain the platform IDs and initialize the context properties
        System.out.println("Obtaining platform...");
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(
            contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);
        if (context == null)
        {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(
                contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null)
            {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
            Pointer.to(devices), null);

        // Create a command-queue
        cl_command_queue commandQueue =
            clCreateCommandQueue(context, devices[0], 0, null);

        // Allocate the memory objects for the input- and output data
        cl_mem memObjects[] = new cl_mem[2];
        memObjects[0] = clCreateBuffer(context,
            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
            Sizeof.cl_int, srcA, null);
        memObjects[1] = clCreateBuffer(context,
            CL_MEM_READ_WRITE,
            Sizeof.cl_float * n, null, null);

        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
            1, new String[]{ programSource }, null, null);

        // Build the program
        clBuildProgram(program, 0, null, null, null, null);

        // Create the kernel
        cl_kernel kernel = clCreateKernel(program, "pi", null);

        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0,
            Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1,
            Sizeof.cl_mem, Pointer.to(memObjects[1]));

        // Set the work-item dimensions
        long global_work_size[] = new long[]{n};
        long local_work_size[] = new long[]{1};

        // Execute the kernel
        long before=System.nanoTime();
         cl_event kernelEvent1 = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
            global_work_size, local_work_size, 0, null, kernelEvent1);
          // Wait for the the events, i.e. until the kernels have completed
        System.out.println("Waiting for events...");
        CL.clWaitForEvents(1, new cl_event[]{kernelEvent1});
        long after=System.nanoTime();
        double kernelTime = (double) (after - before) / 1e9;
        System.out.println("kernel time(secs): "+ kernelTime);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
            n * Sizeof.cl_float, dst, 0, null, null);
         float pi=0;
         for (int i = 0; i < dstArray.length; i++) {
            pi+=dstArray**;
        }
        System.out.println(pi);
        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
    }
        private static String readFile(String fileName) {
        try {
            BufferedReader br = new BufferedReader(new FileReader(fileName));
            //usamos el string builder ya que los strings son inmutables en java
            StringBuilder sb = new StringBuilder();
            String line = null;
            while (true) {
                line = br.readLine();
                if (line == null) {
                    break;
                }
                sb.append(line + "
");
            }
            return sb.toString();
        } catch (IOException e) {
            e.printStackTrace();
            return "";
        }
    }

}

Marco13 · 11. Mai 2010 um 01:15

Ah, sorry, reasonlessly I thought you had an AMD card.

I’ll give the QuadFloats a try as soon as possible.

Marco13 · 12. Mai 2010 um 13:13

Hello

I have ported the example you posted so that it uses QuadFloat. Besides some slight modifications (my Spanish is not the best, I renamed some variables ) I also added a test for comparing the results to a plain Java computation.

In fact, using QuadFloat is very slow. Maybe I should also port the „DoubleFloat“ functions, which should be much more efficient but still give a noticably higher precision than plain floats. However, it shows that the computation with QuadFloat seems to give more correct decimal places than plain float, but since everything has to be written back into a float array and then summed up into a float, the precision is not nearly as high as when using double in Java for the array and the sum.

However, here’s the code:

The kernel


__kernel void pi(int intervals, __global float *solution)
{
    uint pid = get_global_id(0);
    uint globalSize = get_global_size(0);

    float4 intervalWidth = qfAssign(0.0f);
    float4 x = qfAssign(0.0f);
    float4 sum = qfAssign(0.0f);
    float4 xx = qfAssign(0.0f);
    float4 tmp = qfAssign(0.0f);

    int elements = intervals / globalSize;
    int rest = intervals % globalSize;
    int length = elements;
    if (pid == globalSize - 1)
    {
        length += rest;
    }

    intervalWidth = qfDiv(qfAssign(1), qfAssign(intervals));
    for (int i = pid * elements; i < pid * elements + length; i++)
    {
        //x = (i + 0.5f) * intervalWidth;
        qfMul(&x, qfAssign(i+0.5f), intervalWidth);

        // sum += 4.0f / (1.0f + x * x)
        qfMul(&xx, x, x);
        qfAdd(&tmp, xx, qfAssign(1));
        tmp = qfDiv(qfAssign(4), tmp);
        qfAdd(&sum, sum, tmp);
    }
    //solution[pid] = sum * intervalWidth;
    qfMul(&tmp, sum, intervalWidth);
    solution[pid] = tmp.x;
}

The main program:

package test;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import static org.jocl.CL.*;
import org.jocl.*;

public class PITest
{
    private static String programSource0 = readFile("kernels/QuadFloat.cl");
    private static String programSource1 = readFile("kernels/Picl.cl");

    public static void main(String args[])
    {
        // Create input- and output data
        int n = 512;
        float dstArray[] = new float[n];
        for (int i = 0; i < n; i++)
        {
            dstArray** = 0;
        }
        Pointer dst = Pointer.to(dstArray);
        int intervals = 10000000;

        // Obtain the platform IDs and initialize the context properties
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(contextProperties,
            CL_DEVICE_TYPE_GPU, null, null, null);
        if (context == null)
        {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(contextProperties,
                CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null)
            {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        // and obtain the cl_device_id for the first device
        long numBytes[] = new long[1];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 
            numBytes[0], Pointer.to(devices), null);

        // Create a command-queue, program and kernel
        cl_command_queue commandQueue = clCreateCommandQueue(context,
            devices[0], 0, null);
        cl_program program = clCreateProgramWithSource(context, 2,
            new String[] { programSource0, programSource1 }, null, null);
        clBuildProgram(program, 0, null, null, null, null);
        cl_kernel kernel = clCreateKernel(program, "pi", null);

        
        // Allocate the memory objects for the input- and output data
        cl_mem memObject = clCreateBuffer(context, CL_MEM_READ_WRITE,
            Sizeof.cl_float * n, null, null);
        
        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0, Sizeof.cl_int, 
            Pointer.to(new int[] { intervals }));
        clSetKernelArg(kernel, 1, Sizeof.cl_mem, 
            Pointer.to(memObject));

        // Set the work-item dimensions and execute the kernel
        long global_work_size[] = new long[] { n };
        long before = System.nanoTime();
        cl_event kernelEvent1 = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size,
            null, 0, null, kernelEvent1);

        // Wait for the the events, i.e. until the kernels have completed
        System.out.println("Waiting for events...");
        clWaitForEvents(1, new cl_event[] { kernelEvent1 });
        long after = System.nanoTime();
        double kernelTime = (double) (after - before) / 1e9;
        System.out.println("kernel time(secs): " + kernelTime);

        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObject, CL_TRUE, 0, 
            n * Sizeof.cl_float, dst, 0, null, null);

        // Compute the result
        float pi = 0;
        for (int i = 0; i < dstArray.length; i++)
        {
            pi += dstArray**;
        }
        System.out.printf("OpenCL: %.20f
", pi);

        // Compute the result using Java (only float)
        pi = 0;
        before = System.nanoTime();
        test(intervals, dstArray, n);
        after = System.nanoTime();
        double javaTime = (double) (after - before) / 1e9;
        System.out.println("java   time(secs): " + javaTime);
        for (int i = 0; i < dstArray.length; i++)
        {
            pi += dstArray**;
        }
        System.out.printf("Java  : %.20f
", pi);

        
        // Release kernel, program, and memory objects
        clReleaseMemObject(memObject);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
    }

    private static String readFile(String fileName)
    {
        try
        {
            BufferedReader br = new BufferedReader(new FileReader(fileName));
            // usamos el string builder ya que los strings son inmutables en
            // java
            StringBuilder sb = new StringBuilder();
            String line = null;
            while (true)
            {
                line = br.readLine();
                if (line == null)
                {
                    break;
                }
                sb.append(line + "
");
            }
            return sb.toString();
        }
        catch (IOException e)
        {
            e.printStackTrace();
            return "";
        }
    }

    private static void test(int intervals, float solution[], int get_global_size)
    {
        for (int pid = 0; pid < get_global_size; pid++)
        {
            // uint pid = get_global_id(0);
            // uint globalSize = get_global_size(0);
            int globalSize = get_global_size;
            
            float intervalWidth = 0;
            float x = 0;
            float sum = 0.0f;
            int elements = intervals / globalSize;
            int rest = intervals % globalSize;
            int length = elements;
            if (pid == globalSize - 1)
            {
                length += rest;
            }
            intervalWidth = 1.0f / (float) intervals;
            for (int i = pid * elements; i < pid * elements + length; i++)
            {
                x = (i + 0.5f) * intervalWidth;
                sum = sum + 4.0f / (1.0f + x * x);

            }
            solution[pid] = sum * intervalWidth;
        }
    }
}

It also requires the http://jocl.org/utilities/QuadFloat.cl file…

Neuw84 · 12. Mai 2010 um 13:57

Thanks very much, I will give it a try.
I’m trying to get more flops from my version, doing this simple calcle 60 Miterations * 6 flops / 0.6 segs = 600 Mflops
My GPU has theorically 25 Gflops in single precision, I think is too slow (with my implementation) (this kind of opperation don’t have dependencies or anything weird… If I can get some improvements in speed the next days I will post here them.
Thanks again.

Marco13 · 12. Mai 2010 um 15:32

So when it’s all about achieving the maximum number of FLOPS/s, and not about achieving a higher precision, there certainly are some approaches. The first would IMHO be to assure that all Blocks have the same size (I’m not sure if the computation of the ‘resto’ is correct and makes sense… maybe I’ll find the time to have a closer look at this…)

Neuw84 · 14. Mai 2010 um 07:43

Hi,
I’m thinking in deleting the for bucle and make the iterations via global_work_size and local_work_size for improving the speed, but I need to study some CUDA manuals because I need to learn more about this (working with dimensions… NDRange…).
Although I have improve the code in the kernel doing the reduction off the array inside it.

//#pragma OPENCL EXTENSION cl_khr_fp64 : enable
   __kernel void pi(__global const int * iterNumb,__global float * outArray,__global float * pi)
   {
   
   int iter=iterNumb[0];
   uint pid = get_global_id(0);
   uint nth =  get_global_size(0); 
   float	paso = 0.0f, x = 0.0f, sum = 0.0f;
   int step = iterNumb[0] / nth;
   int rest = iterNumb[0] % nth;
   paso = 1.0f / iter;
   //each procesor do his step
   //pragma unroll  
   for (int i = pid*step; i < (pid+1)*step; i++) {
           x = (i + 0.5f) * paso;
           sum = sum + 4.0f / (1.0f + x * x);
       
   }
   //if we are the last procesor we do the step + rest
   if(pid == nth-1){
   	for (int i = pid*step; i < pid*step+rest; i++) {
           x = (i + 0.5f) * paso;
           sum = sum + 4.0f / (1.0f + x * x);
       
   }
   }
   outArray[pid]=sum * paso;
   //then all of processors write the result to the outArray 
   //we wait to all processors finish
   barrier(CLK_GLOBAL_MEM_FENCE);
   
   //we make parallel reduction  on GPU
   for(int s=get_global_size(0)/2;s>0;s=s/2)
   {
   if(pid<s) outArray[pid]+=outArray[pid+s];
   barrier(CLK_GLOBAL_MEM_FENCE);
   }

   if(pid==0) pi[0]=outArray[0];
   }

Marco13 · 14. Mai 2010 um 08:23

Yes, doing the reduction inside the kernel is definetly a more sophisticated approach.

A reduction itself is very simple (if not trivial), but doing it “good” on the GPU may not be so easy. There are several examples of a Reduction with OpenCL on the web (for example from NVIDIA, although I currently can’t download that). As far as I remember, some of them contain tricky optimizations, including the use of local memory.

I think you could either…

do the computation of the vector in your kernel, and apply one of the existing reduction kernels to the result or
combine the actual computation and the reduction in a single kernel, as you did now
The latter may be promising, because it might be possible to do all the computations in local memory, which would probably beat any approach using global memory by an order of magnitude. However, I’m not (yet) so deeply involved in actual OpenCL programming, so I’d also have to think a while about how this could be done so efficiently…

Neuw84 · 18. Mai 2010 um 01:14

Hi,
I tried to do all the things on shared memory but the reduction is wrong… ( I tried a lot of things reading the examples at Nvidia SDK), I think I don’t have understood the memory model correctly…

Although here is my code (if I write to global memory “outArray” the numbers are ok, but in shared memory “sdata” ???)

 //#pragma OPENCL EXTENSION cl_khr_fp64 : enable
	__kernel void pi(__global const int * iterNumb,__local float * sdata,__global float * pi,__global float * outArray,__global const int * stepp, 
	__global const int * restt)
	{

	unsigned int iter=iterNumb[0];
	unsigned pid = get_global_id(0);
	unsigned tid = get_local_id(0);
	unsigned nth =  get_global_size(0); 
	float  x = 0.0f, sum = 0.0f;
	sdata[tid] = 0.0f;
	//modulo operation is slow on gpus we avoid it
	unsigned int step = stepp[0];
	unsigned int rest = restt[0];
    float paso = 1.0f / iter;
	//each procesor do his step
//#pragma unroll  
    for (int i = pid*step; i < (pid+1)*step; i++) {
            x = (i + 0.5f) * paso;
            sum = sum + 4.0f / (1.0f + x * x);
        
	}
	//if we are the last procesor we do the step + rest
	if(pid == nth-1){
		for (int i = pid*step; i < pid*step+rest; i++) {
            x = (i + 0.5f) * paso;
            sum = sum + 4.0f / (1.0f + x * x);
        
	}
	}
	//we write our results in shared memory to improve parallel reduction
	sdata[pid]=sum * paso; 
	outArray[pid]=sdata[pid];
	//we wait to all processors to finish writing in local memory
	barrier(CLK_LOCAL_MEM_FENCE);
	
	//we make parallel reduction  on GPU (we use secuential addressing)
	//no bank conflicts
	for(unsigned int s=nth/2;s>0;s>>=1) //bit displacement is cheaper than div
	{
	  if(pid<s) {
	    sdata[pid]+=sdata[pid+s];
	    outArray[pid]+=outArray[pid+s];
	}
	barrier(CLK_LOCAL_MEM_FENCE);
	}
	//we write back the results to global memory
	if(pid==0) pi[0]=sdata[0];
	}

Marco13 · 18. Mai 2010 um 11:05

Hi

I adjusted the program accordingly, although I’m not 100% sure if this is all right.

BTW: You’ve been passing single integer values to the kernel using cl_mem-objects. For example, you declared your kernel as


__kernel void pi(__global const int * iterNumb ...)
{
    int iter = iterNumb[0];
    ...

And I assume that you have done something like this on Java side:

int iterNumb[] = new int[1];
iterNumb[0] = 12345;
cl_mem iterNumbMem = clCreateBuffer(context, CL_MEM_READ_WRITE, 1 * Sizeof.cl_int, null, null);
clEnqueueWriteBuffer(commandQueue, iterNumbMem, true, 0, Sizeof.cl_int, Pointer.to(iterNumb), 0, null, null);
clSetKernelArg(kernel, 0, 1 * Sizeof.cl_mem, Pointer.to(iterNumbMem));

But single primtive values may be passed directly, which is much easier. Just declare your kernel as


__kernel void pi(int iter, ...)

Then you can do the following on Java side:

int iter = 12345;
clSetKernelArg(kernel, 0, Sizeof.cl_int, Pointer.to(new int[]{iter}));

Here’s the adjusted program and the respective kernel:

package test;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import static org.jocl.CL.*;
import org.jocl.*;

public class PITest2
{
    private static String programSource0 = readFile("kernels/QuadFloat.cl");
    private static String programSource1 = readFile("kernels/Picl2.cl");

    public static void main(String args[])
    {
        // Create input- and output data
        int n = 512;
        int intervals = 10000000;

        // Obtain the platform IDs and initialize the context properties
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(contextProperties,
            CL_DEVICE_TYPE_GPU, null, null, null);
        if (context == null)
        {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(contextProperties,
                CL_DEVICE_TYPE_CPU, null, null, null);

            if (context == null)
            {
                System.out.println("Unable to create a context");
                return;
            }
        }

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Get the list of GPU devices associated with the context
        // and obtain the cl_device_id for the first device
        long numBytes[] = new long[1];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 
            numBytes[0], Pointer.to(devices), null);

        // Create a command-queue, program and kernel
        cl_command_queue commandQueue = clCreateCommandQueue(context,
            devices[0], 0, null);
        cl_program program = clCreateProgramWithSource(context, 2,
            new String[] { programSource0, programSource1 }, null, null);
        clBuildProgram(program, 0, null, null, null, null);
        cl_kernel kernel = clCreateKernel(program, "pi", null);

        
        // Allocate the memory objects for the input- and output data
        cl_mem piMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
            Sizeof.cl_float, null, null);

        long global_work_size[] = new long[] { n };

        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0, Sizeof.cl_int, 
            Pointer.to(new int[] { intervals }));
        clSetKernelArg(kernel, 1, Sizeof.cl_mem, 
            Pointer.to(piMem));
        clSetKernelArg(kernel, 2, n * Sizeof.cl_float, null);

        
        long before = System.nanoTime();
        cl_event kernelEvent1 = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size,
            null, 0, null, kernelEvent1);

        // Wait for the the events, i.e. until the kernels have completed
        System.out.println("Waiting for events...");
        clWaitForEvents(1, new cl_event[] { kernelEvent1 });
        long after = System.nanoTime();
        double kernelTime = (double) (after - before) / 1e9;
        System.out.println("kernel time(secs): " + kernelTime);

        // Read the output data
        float result[] = new float[1];
        clEnqueueReadBuffer(commandQueue, piMem, CL_TRUE, 0, 
            Sizeof.cl_float, Pointer.to(result), 0, null, null);

        System.out.printf("OpenCL: %.20f
", result[0]);

        // Compute the result using Java (only float)
        float pi = 0;
        float dstArray[] = new float[n];
        before = System.nanoTime();
        test(intervals, dstArray, n);
        after = System.nanoTime();
        double javaTime = (double) (after - before) / 1e9;
        System.out.println("java   time(secs): " + javaTime);
        for (int i = 0; i < dstArray.length; i++)
        {
            pi += dstArray**;
        }
        System.out.printf("Java  : %.20f
", pi);

        
        // Release kernel, program, and memory objects
        clReleaseMemObject(piMem);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
    }

    private static String readFile(String fileName)
    {
        try
        {
            BufferedReader br = new BufferedReader(new FileReader(fileName));
            // usamos el string builder ya que los strings son inmutables en
            // java
            StringBuilder sb = new StringBuilder();
            String line = null;
            while (true)
            {
                line = br.readLine();
                if (line == null)
                {
                    break;
                }
                sb.append(line + "
");
            }
            return sb.toString();
        }
        catch (IOException e)
        {
            e.printStackTrace();
            return "";
        }
    }

    private static void test(int intervals, float solution[], int get_global_size)
    {
        for (int pid = 0; pid < get_global_size; pid++)
        {
            // uint pid = get_global_id(0);
            // uint globalSize = get_global_size(0);
            int globalSize = get_global_size;
            
            float intervalWidth = 0;
            float x = 0;
            float sum = 0.0f;
            int elements = intervals / globalSize;
            int rest = intervals % globalSize;
            int length = elements;
            if (pid == globalSize - 1)
            {
                length += rest;
            }
            intervalWidth = 1.0f / (float) intervals;
            for (int i = pid * elements; i < pid * elements + length; i++)
            {
                x = (i + 0.5f) * intervalWidth;
                sum = sum + 4.0f / (1.0f + x * x);

            }
            solution[pid] = sum * intervalWidth;
        }
    }
}

Kernel “Picl2.cl”

//openCL extensions for double
//#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void pi(unsigned int iter, __global float * pi, __local float * sdata)
{
    unsigned pid = get_global_id(0);
    unsigned nth =  get_global_size(0);
    float  x = 0.0f, sum = 0.0f;
    sdata[pid] = 0.0f;
    int step = iter / nth;
    int rest = iter % nth;
    float paso = 1.0f / iter;

    //each procesor do his step
    //#pragma unroll
    for (int i = pid*step; i < (pid+1)*step; i++) {
        x = (i + 0.5f) * paso;
        sum = sum + 4.0f / (1.0f + x * x);

    }
    //if we are the last procesor we do the step + rest
    if(pid == nth-1){
        for (int i = pid*step; i < pid*step+rest; i++) {
            x = (i + 0.5f) * paso;
            sum = sum + 4.0f / (1.0f + x * x);

        }
    }
    //we write our results in shared memory to improve parallel reduction
    sdata[pid]=sum * paso;
    //we wait to all processors to finish writing in local memory
    barrier(CLK_LOCAL_MEM_FENCE);

    //we make parallel reduction  on GPU (we use secuential addressing)
    //no bank conflicts
    for(unsigned int s=nth/2;s>0;s>>=1) //bit displacement is cheaper than div
    {
        if(pid<s) {
            sdata[pid]+=sdata[pid+s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    //we write back the results to global memory
    if(pid==0) pi[0]=sdata[0];
}