Correct usage of clFinish?

Hi all.

The document for clfinish is
**clFinish does not return until all queued commands in command_queue have been processed and completed. clFinish is also a synchronization point. **

Please see the following code, for 125’000’000 calculations it takes the kernel** 582ms** to execute with clFinish()

When** i dont** use the clFinish() the kernel execute in 14ms and when comparing the result from both runs its still the same

Do I use the clFinish() wrong?

	    long startTime = System.currentTimeMillis();

		// Execute the kernel
		clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);

		clFinish(commandQueue);
		
		long endTime = System.currentTimeMillis();
		System.out.println("Time to execute kernel " + n + " calculations; " + (endTime-startTime) + " ms");

		long startReadBuffer = System.currentTimeMillis();	
		// Read the memory on device and store it into host mem dst variable
		clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE, 0, n * Sizeof.cl_double, dst, 0, null, null);
		long endReadBuffer = System.currentTimeMillis();	

		System.out.println("Read data from device; " + (endReadBuffer-startReadBuffer) + " ms");
		

Result with clFinish:
Allocated global memory for input data: 500.0 MB
Allocated global memory for result data: 1000.0 MB

GPU init + buffer create was 1562 ms
Time to execute kernel 125000000 calculations; 618 ms
Read data from device; 489 ms

Data in pos: [0,0,0]: 2.381773290676036
Data in pos: [9,0,0]: 1.7791850202025574
Data in pos: [0,9,0]: 67.25440184324135
Data in pos: [0,0,1]: 500.1776124768833
Data in pos: [9,1,9]: 1499.4575887415901
Data in pos: [214,127,210]: 7251.412959082049

Result without clFinish()
Allocated global memory for input data: 500.0 MB
Allocated global memory for result data: 1000.0 MB

GPU init + buffer create was 1549 ms
Time to execute kernel 125000000 calculations; 14 ms
Read data from device; 1019 ms

Data in pos: [0,0,0]: 2.381773290676036
Data in pos: [9,0,0]: 1.7791850202025574
Data in pos: [0,9,0]: 67.25440184324135
Data in pos: [0,0,1]: 500.1776124768833
Data in pos: [9,1,9]: 1499.4575887415901
Data in pos: [214,127,210]: 7251.412959082049

Thanks

//Fredrik

*** Edit ***

Hi.

I just found out :slight_smile:

If not using the clFinish() the clEnqueueReadBuffer() will be added this time. Since clEnqueueReadBuffer() is waiting until kernel is complete

Correct?

//Fredrik

[QUOTE=Fredrik]If not using the clFinish() the clEnqueueReadBuffer() will be added this time. Since clEnqueueReadBuffer() is waiting until kernel is complete

Correct?

[/QUOTE]

Yes, basically that’s the point. In fact, I think that OpenCL implementations do (or are allowed to do) some “magic tracking” here, so that the “clEnqueueReadBuffer” call will only block if it involves a “cl_mem” that is still involved in another computation. But in any case, here the ~600ms from the kernel launch have simply been deferred to the “clEnqueueReadBuffer” call.

Yes, seems like that.

Thanks for confirm :slight_smile: