Tuning with JOCL

Fredrik · 12. September 2016 um 12:21

Hi.

I have been working with an simple example to tuning my JOCL software.

Lots of things could be done like use pinned memory, zero copy, using single precision and math relaxing.

Now im generating input data in a seperate CPU thread and doing init for JOCL in the main CPU thread. This takes sometimes 200-800ms to generate the inputdata to the kernel (superlarge input).

What I cant figure out is to tuning the JOCL startup, as of now it takes around 400-600ms to initialize JOCL.

Is there so good way like doing the startup in 2 threads or some other solutions, should be very nice
if this could be cut in half to 200-300 ms

Thanks

//Fredrik

Marco13 · 13. September 2016 um 04:24

Hello

Admittedly, I never did (and hardly do) consider a startup time of 400ms as critical, compared to the overhead of starting the JVM. It’s not entirely clear what the application pattern is, and where these 400ms come from. IF (!) they are due the initialization of OpenCL (obtaining platform, context and device), then there’s no way to avoid this (one could only try to do it earlier, at a point where 400ms don’t harm). The main overhead that could be caused by JOCL itself is the unpacking and loading of the native library (but the unpacking should only happen ONCE - after that, the library should be in the TEMP folder, and loaded from there). One could then try to load the JOCL class earlier, but that’s just a vague guess.

Do you know more exactly where these 400ms come from?

bye
Marco

Fredrik · 13. September 2016 um 08:12

Hi Marco

Sure, its not critical with 400 or 600 ms. Would just be nice to make it shorter if possible. Lets say the calculations only is done in 300ms then another 400-600ms is needed just for the default init.

But as you said, its just needed once, after that the lib is loaded and can be reused

The 400-600ms is coming from my own benchmark, this is for the normal startup time on my PC.

Its all between

cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(platforms.length, platforms, null);
and…
clBuildProgram(program, 0, null, null, null, null);
kernel = clCreateKernel(program, „somekernel“, null);

Thanks

//Fredrik

Marco13 · 13. September 2016 um 10:12

OK, then this involves also the compilation of the program. This can, in fact, take arbitrarily long. (IIRC there have been border cases of excessive nested loops, where some compilers basically seemed to do an optimization that took O(n^2) (or even longer), and where the PC seemingly hung up when trying to compile the program).

If this is really time critical, you can consider compiling the code only once and then load only the binaries, but this is a bit fiddly, less flexible, and it’s hard to tell beforehand whether this brings a noticable speedup.