Computing execution time

Marco, you sent me an example of how events can be used. The question, I compare the host side against the device side. I use cuda events for the device part as you sent.

In c cuda programs, I use the following code to measure time on host for serial code:

void StartCounter()
{
	LARGE_INTEGER li;
	if (!QueryPerformanceFrequency(&li))
		printf("QueryPerformanceFrequency failed!\n");

	PCFreq = double(li.QuadPart) / 1000.0;

	QueryPerformanceCounter(&li);
	CounterStart = li.QuadPart;
}
double GetCounter()
{
	LARGE_INTEGER li;
	QueryPerformanceCounter(&li);
	return double(li.QuadPart - CounterStart) / PCFreq;
}



StartCounter();
sumMatrixOnHost(h_A, h_B, hostRef, nx, ny);
printf("sumMatrixOnCPU %f ms\n", GetCounter());

Using QueryPerformanceFrequency, Is it available in Java? If not, Are there other ways to estimate serial function time on host using Jcuda? Thanks for great efforts.

One option would be the following:

long startNs = System.nanoTime();
sumMatrixOnHost(...);
long endNs = System.nanoTime();
long durationMs = (endNs - startNs) / 1e6;
System.out.println("That took " + durationMs + " millisecoonds");

But of course, measuring the execution time of a Java program in this way hardly makes sense: The Just-In-Time-Compiler will distort these results. There are some ways to alleveiate that problem. Usually, you should at least perform the task (sumMatrixOnHost here) multiple times, with different input sizes, make sure that the result of the computation is used (to prevent it from being optimized away), and compute the average computation time for multiple runs, and start with -verbose:gc to see whether the garbage collection might distort the results. It’s complicated.

So, how to accurately compare the two times, the code in serial and the same code in parallel.

All books I read related to cuda, given times by the way I sent you.
For Jcuda, how I estimate the time accurately?. I have finished the implementation of the kernels for a set of tokens (part of the ontology) and their serial counterparts. I will apply on whole ontology contains thousands or hundreds of thousands of elements. I need the difference in timing, please.

For the CUDA/JCuda part, you can use the events to compute the execution time of the kernel.

For the Host/Java part… I could point you to MicroBenchmarks - MicroBenchmarks - OpenJDK Wiki . You could spend time learning GitHub - openjdk/jmh: https://openjdk.java.net/projects/code-tools/jmh/ , and setting up a proper benchmark. You could read about the JIT and garbage collection. You could analyze your data, and generate different data sets for the comparison. You could do research about that topic, in order to make a performance claim that is profound and useful for others.

Or you could just use the function that I showed you. It’s probably „good enough“.

Marco, When I apply the timer you sent

ong startNs = System.nanoTime();
sumMatrixOnHost(...);
long endNs = System.nanoTime();
long durationMs = (endNs - startNs) / 1e6;
System.out.println("That took " + durationMs + " millisecoonds");

it gives 0, why?

I wrote long durationMs, it should have been double durationMs. Here is a complete example:

public class ExecutionTime
{
    public static void main(String[] args)
    {
        long startNs = System.nanoTime();
        sumMatrixOnHost();
        long endNs = System.nanoTime();
        double durationMs = (endNs - startNs) / 1e6;
        System.out.println("That took " + durationMs + " millisecoonds");
    }

    private static void sumMatrixOnHost()
    {
        for (int i = 0; i < 10; i++)
        {
            System.out.println("CPU goes brrrr...");
        }
    }
}

But note that in order to handle the possible optimization by the JIT, you’ll have to do „something“ with the result of the sumMatrixOnHost function.