Small differens in performance between SP and DP

system · 25. März 2016 um 14:57

Hi.

Just did some benchmarking and compared SP with DP. The difference was not so much as expected. Only like 50% slower with DP.

I calculated sin(gid)+cos(gid)+sqrt(gid) 27’000’000 times.

With SP it took 204ms and with DP around 300ms. With CPU (single thread) it took 24 sec.

I was expecting a 10-fold difference.

Anyone got an idé why difference is so small?

My card has 0.7TFLOPS DP and 3.5TFLOPS in SP.

//Fredrik

Marco13 · 25. März 2016 um 18:20

It’s hard to say what may be the reasons for this.

How exactly did you do the bechmark? That is, which time did you measure, and how? Note that in many cases, the memory copies may be the most expensive part.

system · 26. März 2016 um 02:00

Hi Marco13

The benchmark include only the clEnqueueNDRangeKernel.

This is interesting but well, Im very happy anyway since the performance is very good anyway

Note: my card is 3.5TFLOPS SP and 0.2 TFLOPS DP (not 0.7)

//Fredrik

Marco13 · 26. März 2016 um 08:06

Maybe you could post your Benchmark (as a small, compilable example) so that others can

test if they achieve similar results
point out what might be wrong

dragandj · 26. März 2016 um 10:19

Did you use the results of these calculations? They might be optimized away by the compiler. Also, these operations are rather fast. The overhead might eat most of the time, and the overhead for SP and DP should not be much different.

system · 28. März 2016 um 00:32

Hi all

For the ones who like to compare

My PC: AMD 8350(CPU), 2X R9 380 4GB (GPU) and 16GB ram.

First version is DP code version and the other is SP version.

For me result for DP is:

**GPU init + buffer create was 245 ms
Time to execute kernel with 27000000 calculations; 244 ms

Control data [0,0,0]: 2.381773290676036
Control data [9,0,0]: 1.7791850202025574
Control data [0,9,0]: 51.99337081388339
Control data [0,0,1]: 301.40362615629846
Control data [9,1,9]: 899.8805224704302

Time to execute on single thread CPU 27000000 calculations; 24620 ms**

and for the SP version:

**GPU init + buffer create was **192 **ms
Time to execute kernel with 27000000 calculations; 120 ms

Control data [0,0,0]: 2.3817732
Control data [9,0,0]: 1.779185
Control data [0,9,0]: 51.99337
Control data [0,0,1]: 301.4036
Control data [9,1,9]: 899.8805**

DP code

import org.jocl.*;
import static org.jocl.CL.*;

public class Driver
{	
	private static int arrayXLength = 300;
	private static int arrayYLength = 300;
	private static int arrayZLength = 300;
	private static boolean COMPARE_WITH_CPU = true;
	
	private static String programSource =
			"__kernel void sampleKernel(__global const int *in, __global double *out)"+
					"{"+
					"    int gid = get_global_id(0);"+		
					"    out[gid] = sin((double)in[gid]) + cos((double)in[gid]) + sqrt((double)in[gid]);"+ 
					"}";

	private static cl_context context;
	private static cl_command_queue commandQueue;
	private static cl_kernel kernel;
	private static cl_program program;

	public static void main(String args[])
	{
		cl_platform_id platforms[] = new cl_platform_id[1];
		clGetPlatformIDs(platforms.length, platforms, null);
		cl_context_properties contextProperties = new cl_context_properties();
		contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

		CL.setExceptionsEnabled(true);

		context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);

		long numBytes[] = new long[1];
		clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

		int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
		cl_device_id devices[] = new cl_device_id[numDevices];
		clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);

		commandQueue =	clCreateCommandQueue(context, devices[0], 0, null);

		program = clCreateProgramWithSource(context, 1, new String[]{ programSource }, null, null);

		clBuildProgram(program, 0, null, null, null, null);

		kernel = clCreateKernel(program, "sampleKernel", null);

		int n = arrayXLength * arrayYLength * arrayZLength;

		int array[] = new int[n];
		initArray(array);

		double dstArray[] = new double[n];
		Pointer dst = Pointer.to(dstArray);

		Pointer ptArray = Pointer.to(array);

		int[] pos = new int[]{n};
		Pointer posPointer = Pointer.to(pos);

		long startInit = System.currentTimeMillis();		

		//Create memory buffers on the device
		cl_mem mem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_int * n, ptArray, null);
		cl_mem memResult = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_double * n , posPointer, null);

		clEnqueueWriteBuffer(commandQueue, mem, CL_TRUE, 0, Sizeof.cl_int * n, ptArray,0,null,null);

		// Set the arguments for the kernel
		clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
		clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(memResult));

		long endInit = System.currentTimeMillis();		
		System.out.println("
GPU init + buffer create was " + (endInit-startInit) + " ms");

		long global_work_size[] = new long[]{n};

		long local_work_size[] = new long[]{256};

		long startTime = System.currentTimeMillis();

		// Execute the kernel
		clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);

		// Read the memory on device and store it into host mem dst variable
		clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE, 0, n * Sizeof.cl_double, dst, 0, null, null);

		long endTime = System.currentTimeMillis();
		System.out.println("Time to execute kernel  with " + n + " calculations; " + (endTime-startTime) + " ms");

		System.out.print("
Control data [0,0,0]: "+ getData(dstArray, 0, 0, 0));
		System.out.print("
Control data [9,0,0]:  "+ getData(dstArray, 9, 0, 0));
		System.out.print("
Control data [0,9,0]: "+ getData(dstArray, 0, 9, 0));
		System.out.print("
Control data [0,0,1]: "+ getData(dstArray, 0, 0, 1));
		System.out.print("
Control data [9,1,9]: "+ getData(dstArray, 9, 1, 9));

		clReleaseKernel(kernel);
		clReleaseProgram(program);
		clReleaseMemObject(mem);
		clReleaseMemObject(memResult);
		clReleaseCommandQueue(commandQueue);
		clReleaseContext(context);

		if(COMPARE_WITH_CPU)
		{		
			startTime = System.currentTimeMillis();
			for(int i = 0 ; i < n ; i++)
			{
				float tmp = (float) (Math.sin(i) + Math.cos(i) + Math.sqrt(i));	
			}
			
			endTime = System.currentTimeMillis();
			System.out.println("

Time to execute on single thread CPU " + n + " calculations; " + (endTime-startTime) + " ms");
		}
	}

	private static double getData(double[] array, int x, int y, int z)
	{
		return array[x + (y * arrayXLength) + (z * arrayYLength * arrayZLength)];		
	}

	private static void initArray(int[] array) 
	{	
		for(int x = 0 ; x < arrayXLength*arrayYLength*arrayZLength ; x++)
		{
			array[x] = x+1; 	
		}		
	}
}

and the SP version

import org.jocl.*;
import static org.jocl.CL.*;

public class Driver
{	
	private static int arrayXLength = 300;
	private static int arrayYLength = 300;
	private static int arrayZLength = 300;
	private static boolean COMPARE_WITH_CPU = true;
	
	private static String programSource =
			"__kernel void sampleKernel(__global const int *in, __global float *out)"+
					"{"+
					"    int gid = get_global_id(0);"+		
					"    out[gid] = sin((float)in[gid]) + cos((float)in[gid]) + sqrt((float)in[gid]);"+ 
					"}";

	private static cl_context context;
	private static cl_command_queue commandQueue;
	private static cl_kernel kernel;
	private static cl_program program;

	public static void main(String args[])
	{
		cl_platform_id platforms[] = new cl_platform_id[1];
		clGetPlatformIDs(platforms.length, platforms, null);
		cl_context_properties contextProperties = new cl_context_properties();
		contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);

		CL.setExceptionsEnabled(true);

		context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU, null, null, null);

		long numBytes[] = new long[1];
		clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);

		int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
		cl_device_id devices[] = new cl_device_id[numDevices];
		clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0], Pointer.to(devices), null);

		commandQueue =	clCreateCommandQueue(context, devices[0], 0, null);

		program = clCreateProgramWithSource(context, 1, new String[]{ programSource }, null, null);

		clBuildProgram(program, 0, null, null, null, null);

		kernel = clCreateKernel(program, "sampleKernel", null);

 		int n = arrayXLength * arrayYLength * arrayZLength;

		int array[] = new int[n];
		initArray(array);

		float dstArray[] = new float[n];
		Pointer dst = Pointer.to(dstArray);

		Pointer ptArray = Pointer.to(array);

		int[] pos = new int[]{n};
		Pointer posPointer = Pointer.to(pos);

		long startInit = System.currentTimeMillis();		

		//Create memory buffers on the device
		cl_mem mem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_int * n, ptArray, null);
		cl_mem memResult = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * n , posPointer, null);

		clEnqueueWriteBuffer(commandQueue, mem, CL_TRUE, 0, Sizeof.cl_int * n, ptArray,0,null,null);

		// Set the arguments for the kernel
		clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(mem));
		clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(memResult));

		long endInit = System.currentTimeMillis();		
		System.out.println("
GPU init + buffer create was " + (endInit-startInit) + " ms");

		long global_work_size[] = new long[]{n};

		long local_work_size[] = new long[]{256};

		long startTime = System.currentTimeMillis();

		// Execute the kernel
		clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);

		// Read the memory on device and store it into host mem dst variable
		clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE, 0, n * Sizeof.cl_float, dst, 0, null, null);

		long endTime = System.currentTimeMillis();
		System.out.println("Time to execute kernel  with " + n + " calculations; " + (endTime-startTime) + " ms");

		System.out.print("
Control data [0,0,0]: "+ getData(dstArray, 0, 0, 0));
		System.out.print("
Control data [9,0,0]: "+ getData(dstArray, 9, 0, 0));
		System.out.print("
Control data [0,9,0]: "+ getData(dstArray, 0, 9, 0));
		System.out.print("
Control data [0,0,1]: "+ getData(dstArray, 0, 0, 1));
		System.out.print("
Control data [9,1,9]: "+ getData(dstArray, 9, 1, 9));

		// Release kernel, program, and memory objects
		clReleaseKernel(kernel);
		clReleaseProgram(program);
		clReleaseMemObject(mem);
		clReleaseMemObject(memResult);
		clReleaseCommandQueue(commandQueue);
		clReleaseContext(context);

		if(COMPARE_WITH_CPU)
		{		
			startTime = System.currentTimeMillis();
			for(int i = 0 ; i < n ; i++)
			{
				float tmp = (float) (Math.sin(i) + Math.cos(i) + Math.sqrt(i));	
			}
			
			endTime = System.currentTimeMillis();
			System.out.println("

Time to execute on single thread CPU " + n + " calculations; " + (endTime-startTime) + " ms");
		}
	}

	private static float getData(float[] array, int x, int y, int z)
	{
		return array[x + (y * arrayXLength) + (z * arrayYLength * arrayZLength)];		
	}

	private static void initArray(int[] array) 
	{	
		for(int x = 0 ; x < arrayXLength*arrayYLength*arrayZLength ; x++)
		{
			array[x] = x+1; 	
		}		
	}
}

Marco13 · 28. März 2016 um 05:18

From quickly skimming over the code (I’ll test it when I’m back at by main development PC), you indeed seem to include the memory transfer time:

long startTime = System.currentTimeMillis();
 
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);
 
// Read the memory on device and store it into host mem dst variable
clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE, 0, n * Sizeof.cl_double, dst, 0, null, null);
 
long endTime = System.currentTimeMillis();

...

In order to really only measure the execution time of the kernel, you could do something like this:


// Make sure that there are no pending operations left
clFinish(commandQueue);

long startTime = System.currentTimeMillis();
 
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);

// Wait until the kernel execution was completed
clFinish(commandQueue);

long endTime = System.currentTimeMillis();
 
// Read the memory on device and store it into host mem dst variable
clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE, 0, n * Sizeof.cl_double, dst, 0, null, null);

...

Alternatively, you could use cl_events to precisely measure the execution time, but in this simple case, the time measurement with currentTimeMillis (or maybe System.nanoTime) should be very close to the “real” time.

system · 28. März 2016 um 08:37

Hi.

Yes, I missed that part so the buffer handling,I should not include that in the kernel messure

I will fix it and try again.

I modified the code so it now uses both my device cards. I splitted the inputdata in half so each device handles 300x300x150 calculations.

Result was very good, opencl seems to scale good, around 100%.

As can be see below, 125-94 = 31ms kernel execution time for 27000000 calculations.

Testresult with 2 X GPU DP (with same messurelogic as version in top of page) :

**Allocated global memory for each device - input data: 108.0 MB
Allocated global memory for each device - result data: 54.0 MB

GPU init + create both buffers 1124 ms (takes longer since Im doing it in serial)
Time to execute both kernels and read both buffers from device 125 ms (much faster since kernels executes in parallell)
Time to just read both buffers from both devices 94 ms

Result from GPU1 [0,0,0]: 2.381773290676036
Result from GPU1 [9,0,0]: 1.7791850202025574
Result from GPU1 [0,9,0]: 51.99337081388339
Result from GPU1 [0,0,1]: 213.54739128521754
Result from GPU1 [9,1,9]: 638.0197402979363

Result from GPU2 [0,0,0]: 2.381773290676036
Result from GPU2 [9,0,0]: 1.7791850202025574
Result from GPU2 [0,9,0]: 51.99337081388339
Result from GPU2 [0,0,1]: 213.54739128521754
Result from GPU2 [9,1,9]: 638.0197402979363**